🚀 New models by Bria.ai, generate and edit images at scale 🚀
The Nemotron family is a group of large language models developed by NVIDIA, specifically engineered to excel at generating high-quality synthetic data for training other, more powerful AI models. Unlike models focused solely on end-user chat or content creation, Nemotron's core strength lies in producing diverse and realistic text-based training examples—including question-answer pairs, instructions, and conversations—that are crucial for the "supervised fine-tuning" stage of AI development. By providing a robust toolkit for creating these datasets, Nemotron acts as a powerful "force multiplier" in the AI training pipeline, enabling developers to build more capable and refined specialized models efficiently and at scale, without relying solely on scarce, human-curated data.
Llama-3.3-Nemotron-Super-49B-v1.5 is a large language model (LLM) optimized for advanced reasoning, conversational interactions, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta's Llama-3.3-70B-Instruct, it employs a Neural Architecture Search (NAS) approach, significantly enhancing efficiency and reducing memory requirements.
Price per 1M input tokens
$0.10
Price per 1M output tokens
$0.40
Release Date
09/9/2025
Context Size
131,072
Quantization
fp8
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="nvidia/Llama-3.3-Nemotron-Super-49B-v1.5",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
NVIDIA-Nemotron-Nano-9B-v2 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so.
Price per 1M input tokens
$0.04
Price per 1M output tokens
$0.16
Release Date
09/9/2025
Context Size
131,072
Quantization
bfloat16
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
NVIDIA Nemotron is a family of open models customized for efficiency, accuracy, and specialized workloads.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | Actions |
---|---|---|---|---|
Llama-3.1-Nemotron-70B-Instruct | 128k | $0.60 | $0.60 | |
Llama-3.3-Nemotron-Super-49B-v1.5 | 128k | $0.10 | $0.40 | |
NVIDIA-Nemotron-Nano-9B-v2 | 128k | $0.04 | $0.16 |
© 2025 Deep Infra. All rights reserved.