We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Use OpenAI API clients with LLaMas
Published on 2023.08.28 by Iskren Chernev
Use OpenAI API clients with LLaMas

Getting started

# create a virtual environment
python3 -m venv .venv
# activate environment in current shell
. .venv/bin/activate
# install openai python client
pip install openai
copy

Choose a model

Run OpenAI chat.completion

import openai

stream = True # or False

# Point OpenAI client to our endpoint
openai.api_key = "<YOUR DEEPINFRA API KEY>"
openai.api_base = "https://api.deepinfra.com/v1/openai"

# Your chosen model here
MODEL_DI = "meta-llama/Llama-2-70b-chat-hf"
chat_completion = openai.ChatCompletion.create(
    model=MODEL_DI,
    messages=[{"role": "user", "content": "Hello world"}],
    stream=stream,
    max_tokens=100,
    # top_p=0.5,
)

if stream:
    # print the chat completion
    for event in chat_completion:
        print(event.choices)
else:
    print(chat_completion.choices[0].message.content)
copy

Note that both streaming and batch mode are supported.

Existing OpenAI integration

If you're already using OpenAI chat completion in your project, you need to change the api_key, api_base and model params:

import openai

# set these before running any completions
openai.api_key = "YOUR DEEPINFRA TOKEN"
openai.api_base = "https://api.deepinfra.com/v1/openai"

openai.ChatCompletion.create(
    model="CHOSEN MODEL HERE",
    # ...
)
copy

Pricing

Our OpenAI API compatible models are priced on token output (just like OpenAI). Our current price is $1 / 1M tokens.

Docs

Check the docs for more in-depth information and examples openai api.

Related articles
DeepSeek V3.2 API Benchmarks: Latency, Throughput & CostDeepSeek V3.2 API Benchmarks: Latency, Throughput & Cost<p>About DeepSeek V3.2 DeepSeek V3.2 is a state-of-the-art large language model that unifies conversational speed and deep reasoning in a single 685B parameter Mixture of Experts (MoE) architecture with 37B parameters activated per token. It is built around three key technical breakthroughs: DeepSeek V3.2 achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) and [&hellip;]</p>
Qwen3.5 27B API Benchmarks: Latency, Throughput & CostQwen3.5 27B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 27B (Reasoning) Qwen3.5 27B is part of Alibaba Cloud&#8217;s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench [&hellip;]</p>
Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)<p>This guide walks you from zero to working: you’ll learn what OCR is (and why PDFs can be tricky), how to turn any PDF—including those with screenshots of tables—into text, and how to let an LLM do the heavy lifting to clean OCR noise, reconstruct tables, and summarize the document. We’ll use DeepInfra’s OpenAI-compatible API [&hellip;]</p>