We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Use OpenAI API clients with LLaMas
Published on 2023.08.28 by Iskren Chernev
Use OpenAI API clients with LLaMas

Getting started

# create a virtual environment
python3 -m venv .venv
# activate environment in current shell
. .venv/bin/activate
# install openai python client
pip install openai
copy

Choose a model

Run OpenAI chat.completion

import openai

stream = True # or False

# Point OpenAI client to our endpoint
openai.api_key = "<YOUR DEEPINFRA API KEY>"
openai.api_base = "https://api.deepinfra.com/v1/openai"

# Your chosen model here
MODEL_DI = "meta-llama/Llama-2-70b-chat-hf"
chat_completion = openai.ChatCompletion.create(
    model=MODEL_DI,
    messages=[{"role": "user", "content": "Hello world"}],
    stream=stream,
    max_tokens=100,
    # top_p=0.5,
)

if stream:
    # print the chat completion
    for event in chat_completion:
        print(event.choices)
else:
    print(chat_completion.choices[0].message.content)
copy

Note that both streaming and batch mode are supported.

Existing OpenAI integration

If you're already using OpenAI chat completion in your project, you need to change the api_key, api_base and model params:

import openai

# set these before running any completions
openai.api_key = "YOUR DEEPINFRA TOKEN"
openai.api_base = "https://api.deepinfra.com/v1/openai"

openai.ChatCompletion.create(
    model="CHOSEN MODEL HERE",
    # ...
)
copy

Pricing

Our OpenAI API compatible models are priced on token output (just like OpenAI). Our current price is $1 / 1M tokens.

Docs

Check the docs for more in-depth information and examples openai api.

Related articles
Nemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra ResultsNemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM landscape is becoming increasingly diverse, with models optimized for reasoning, throughput, cost-efficiency, and real-world agentic applications. Two models that stand out in this new generation are NVIDIA’s Nemotron 3 Nano and OpenAI’s GPT-OSS-20B, both of which offer strong performance while remaining openly available and deployable across cloud and edge systems. Although both [&hellip;]</p>
Pricing 101: Token Math & Cost-Per-Completion ExplainedPricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is [&hellip;]</p>
Guaranteed JSON output on Open-Source LLMs.Guaranteed JSON output on Open-Source LLMs.DeepInfra is proud to announce that we have released "JSON mode" across all of our text language models. It is available through the "response_format" object, which currently supports only {"type": "json_object"} Our JSON mode will guarantee that all tokens returned in the output of a langua...