We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Use OpenAI API clients with LLaMas
Published on 2023.08.28 by Iskren Chernev
Use OpenAI API clients with LLaMas

Getting started

# create a virtual environment
python3 -m venv .venv
# activate environment in current shell
. .venv/bin/activate
# install openai python client
pip install openai
copy

Choose a model

Run OpenAI chat.completion

import openai

stream = True # or False

# Point OpenAI client to our endpoint
openai.api_key = "<YOUR DEEPINFRA API KEY>"
openai.api_base = "https://api.deepinfra.com/v1/openai"

# Your chosen model here
MODEL_DI = "meta-llama/Llama-2-70b-chat-hf"
chat_completion = openai.ChatCompletion.create(
    model=MODEL_DI,
    messages=[{"role": "user", "content": "Hello world"}],
    stream=stream,
    max_tokens=100,
    # top_p=0.5,
)

if stream:
    # print the chat completion
    for event in chat_completion:
        print(event.choices)
else:
    print(chat_completion.choices[0].message.content)
copy

Note that both streaming and batch mode are supported.

Existing OpenAI integration

If you're already using OpenAI chat completion in your project, you need to change the api_key, api_base and model params:

import openai

# set these before running any completions
openai.api_key = "YOUR DEEPINFRA TOKEN"
openai.api_base = "https://api.deepinfra.com/v1/openai"

openai.ChatCompletion.create(
    model="CHOSEN MODEL HERE",
    # ...
)
copy

Pricing

Our OpenAI API compatible models are priced on token output (just like OpenAI). Our current price is $1 / 1M tokens.

Docs

Check the docs for more in-depth information and examples openai api.

Related articles
NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & CostNVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Super 120B A12B NVIDIA&#8217;s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging. The model uses a [&hellip;]</p>
Nemotron 3 Super Provider Pricing Comparison (2026)Nemotron 3 Super Provider Pricing Comparison (2026)<p>Nemotron 3 Super is available from multiple providers, and the price spread is real: OpenRouter lists $0.09/$0.45 per 1M input/output tokens, DeepInfra lists $0.10/$0.50, and the Artificial Analysis median across all providers sits at $0.30/$0.75. The right provider depends on what your workload actually looks like — context requirements, output verbosity, and whether you need [&hellip;]</p>
Open-Source vs Closed-Source AI Models: Is the Gap Worth It?Open-Source vs Closed-Source AI Models: Is the Gap Worth It?<p>The Artificial Analysis Intelligence Index sits at a ceiling of 57. Three frontier models — Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5 — all land in that band. Meanwhile, four open-weight models released between February and April 2026 now score 50 or above on the same index. A year ago, the best open-weight [&hellip;]</p>