We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

nvidia logo

Nemotron Model Family

NVIDIA NemotronTM is a family of open models, datasets, and training recipes engineered for high performance, efficiency and customization. Nemotron models support synthetic data workflows and supervised fine-tuning — and are equally optimized for real-time inference, reasoning agents, and production AI systems.

Featured Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

Price per 1M input tokens

$0.10


Price per 1M cached input tokens

$0.04


Price per 1M output tokens

$0.50


Release Date

03/10/2026


Context Size

262,144


Quantization

bfloat16


# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Featured Model: nvidia/Nemotron-3-Nano-30B-A3B

NVIDIA Nemotron 3 Nano is an open small reasoning model optimized for fast, cost-efficient inference in agentic and production workloads. Built with a hybrid Mixture-of-Experts (MoE) and Mamba-Transformer architecture, it delivers strong multi-step reasoning, high token throughput, stable latency with predictable cost, and efficient deployment for agent-based systems. Designed for real-world AI systems where reasoning can generate significantly more tokens per prompt, Nemotron Nano reduces compute cost while maintaining strong reasoning quality.

Price per 1M input tokens

$0.05


Price per 1M output tokens

$0.20


Release Date

12/15/2025


Context Size

262,144


Quantization

fp4


# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-30B-A3B",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Featured Model: nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL

NVIDIA Nemotron 2 Nano VL extends the Nemotron family into multi-modal reasoning and document intelligence. This auto-regressive vision-language model enables multi-image reasoning, video understanding, visual Q&A and document analysis and summarization. Optimized for enterprise AI workflows, it powers multimodal agentic systems such as visual copilots, document assistants, and knowledge automation pipelines.

Price per 1M input tokens

$0.20


Price per 1M output tokens

$0.60


Release Date

10/28/2025


Context Size

131,072


Quantization

fp8


# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Available Nemotron Models

The Nemotron family spans Nano, Super, and specialized instruct variants, enabling you to balance accuracy, reasoning depth, latency, and cost for your specific workload.

  • Nano for maximum efficiency and stable inference
  • Super for multi-agent systems and advance reasoning
  • Instruct variants for instruction-following and conversational workloads
ModelContext$ per 1M input tokens$ per 1M output tokens
Actions
NVIDIA-Nemotron-3-Super-120B-A12B256k$0.10 / $0.04 cached$0.50
Nemotron-3-Nano-30B-A3B256k$0.05$0.20
NVIDIA-Nemotron-Nano-12B-v2-VL128k$0.20$0.60
Llama-3.1-Nemotron-70B-Instruct128k$1.20$1.20
Llama-3.3-Nemotron-Super-49B-v1.5128k$0.10$0.40
NVIDIA-Nemotron-Nano-9B-v2128k$0.04$0.16

FAQ

How do I integrate Nemotron models into my application?

You can integrate Nemotron models seamlessly using DeepInfra’s OpenAI-compatible API. Just replace your existing base URL with DeepInfra’s endpoint and use your DeepInfra API key—no infrastructure setup required. DeepInfra also supports integration through libraries like openai, litellm, and other SDKs, making it easy to switch or scale your workloads instantly.

What are the pricing details for using Nemotron models on DeepInfra?

Pricing is usage-based:
  • Input Tokens: between $0.04 and $1.20 per million
  • Output Tokens: between $0.16 and $1.20 per million
Prices vary slightly by model. There are no upfront fees, and you only pay for what you use.

How do I get started using Nemotron on DeepInfra?

Sign in with GitHub at deepinfra.com
  • Get your API key
  • Test models directly from the browser, cURL, or SDKs
  • Review pricing on your usage dashboard
Within minutes, you can deploy apps using Nemotron models—without any infrastructure setup.