NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

NVIDIA NemotronTM is a family of open models, datasets, and training recipes engineered for high performance, efficiency and customization. Nemotron models support synthetic data workflows and supervised fine-tuning — and are equally optimized for real-time inference, reasoning agents, and production AI systems.
NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.
Price per 1M input tokens
$0.10
Price per 1M cached input tokens
$0.04
Price per 1M output tokens
$0.50
Release Date
03/10/2026
Context Size
262,144
Quantization
bfloat16
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
NVIDIA Nemotron 3 Nano is an open small reasoning model optimized for fast, cost-efficient inference in agentic and production workloads. Built with a hybrid Mixture-of-Experts (MoE) and Mamba-Transformer architecture, it delivers strong multi-step reasoning, high token throughput, stable latency with predictable cost, and efficient deployment for agent-based systems. Designed for real-world AI systems where reasoning can generate significantly more tokens per prompt, Nemotron Nano reduces compute cost while maintaining strong reasoning quality.
Price per 1M input tokens
$0.05
Price per 1M output tokens
$0.20
Release Date
12/15/2025
Context Size
262,144
Quantization
fp4
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="nvidia/Nemotron-3-Nano-30B-A3B",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
NVIDIA Nemotron 2 Nano VL extends the Nemotron family into multi-modal reasoning and document intelligence. This auto-regressive vision-language model enables multi-image reasoning, video understanding, visual Q&A and document analysis and summarization. Optimized for enterprise AI workflows, it powers multimodal agentic systems such as visual copilots, document assistants, and knowledge automation pipelines.
Price per 1M input tokens
$0.20
Price per 1M output tokens
$0.60
Release Date
10/28/2025
Context Size
131,072
Quantization
fp8
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
The Nemotron family spans Nano, Super, and specialized instruct variants, enabling you to balance accuracy, reasoning depth, latency, and cost for your specific workload.
| Model | Context | $ per 1M input tokens | $ per 1M output tokens | Actions |
|---|---|---|---|---|
| NVIDIA-Nemotron-3-Super-120B-A12B | 256k | $0.10 / $0.04 cached | $0.50 | |
| Nemotron-3-Nano-30B-A3B | 256k | $0.05 | $0.20 | |
| NVIDIA-Nemotron-Nano-12B-v2-VL | 128k | $0.20 | $0.60 | |
| Llama-3.1-Nemotron-70B-Instruct | 128k | $1.20 | $1.20 | |
| Llama-3.3-Nemotron-Super-49B-v1.5 | 128k | $0.10 | $0.40 | |
| NVIDIA-Nemotron-Nano-9B-v2 | 128k | $0.04 | $0.16 |
© 2026 Deep Infra. All rights reserved.