We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

🚀 New models by Bria.ai, generate and edit images at scale 🚀

Nemotron Model Family

The Nemotron family is a group of large language models developed by NVIDIA, specifically engineered to excel at generating high-quality synthetic data for training other, more powerful AI models. Unlike models focused solely on end-user chat or content creation, Nemotron's core strength lies in producing diverse and realistic text-based training examples—including question-answer pairs, instructions, and conversations—that are crucial for the "supervised fine-tuning" stage of AI development. By providing a robust toolkit for creating these datasets, Nemotron acts as a powerful "force multiplier" in the AI training pipeline, enabling developers to build more capable and refined specialized models efficiently and at scale, without relying solely on scarce, human-curated data.

Featured Model: nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL

The model is an auto-regressive vision language model that uses an optimized transformer architecture. The model enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities.

Price per 1M input tokens

$0.20

Price per 1M output tokens

$0.60

Release Date

10/28/2025

Context Size

131,072

Quantization

fp8

# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Featured Model: nvidia/Llama-3.3-Nemotron-Super-49B-v1.5

Llama-3.3-Nemotron-Super-49B-v1.5 is a large language model (LLM) optimized for advanced reasoning, conversational interactions, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta's Llama-3.3-70B-Instruct, it employs a Neural Architecture Search (NAS) approach, significantly enhancing efficiency and reducing memory requirements.

Price per 1M input tokens

$0.10

Price per 1M output tokens

$0.40

Release Date

09/9/2025

Context Size

131,072

Quantization

fp8

# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/Llama-3.3-Nemotron-Super-49B-v1.5",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Available Nemotron Models

NVIDIA Nemotron is a family of open models customized for efficiency, accuracy, and specialized workloads.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
NVIDIA-Nemotron-Nano-12B-v2-VL	128k	$0.20	$0.60	View more
Llama-3.1-Nemotron-70B-Instruct	128k	$1.20	$1.20	View more
Llama-3.3-Nemotron-Super-49B-v1.5	128k	$0.10	$0.40	View more
NVIDIA-Nemotron-Nano-9B-v2	128k	$0.04	$0.16	View more

FAQ

How do I integrate Nemotron models into my application?

You can integrate Nemotron models seamlessly using DeepInfra’s OpenAI-compatible API. Just replace your existing base URL with DeepInfra’s endpoint and use your DeepInfra API key—no infrastructure setup required. DeepInfra also supports integration through libraries like openai, litellm, and other SDKs, making it easy to switch or scale your workloads instantly.