NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About NVIDIA Nemotron 3 Super 120B A12B

NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging.

The model uses a hybrid Mamba2-Transformer LatentMoE architecture with Multi-Token Prediction (MTP), projecting tokens into a smaller latent dimension for expert routing and computation. This improves accuracy per byte and delivers over 5x throughput compared to the previous Nemotron Super generation. Notably, it is the first model in the Nemotron 3 family pre-trained using NVFP4 quantization — meaning it learned to be accurate within the constraints of 4-bit arithmetic from the first gradient update, not just at inference time.

Nemotron 3 Super supports a native 1 million token context window and responds to queries by first generating a reasoning trace before concluding with a final response, making it purpose-built for long-running autonomous agents and high-volume workloads such as IT ticket automation.

Specification	Details
Architecture	Mamba2-Transformer Hybrid Latent Mixture of Experts (LatentMoE) with Multi-Token Prediction (MTP)
Total Parameters	120 billion
Active Parameters	12 billion (per inference pass)
Context Window	Up to 1 million tokens
Training Data	25 trillion tokens
Supported Languages	English, French, German, Italian, Japanese, Spanish, Chinese, plus 43 programming languages
Pre-training Cutoff	June 2025
Post-training Cutoff	February 2026

NVIDIA Nemotron 3 Super 120B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

NVIDIA Nemotron 3 Super 120B API Review Summary

DeepInfra is the lowest-cost provider at $0.20 / 1M tokens (blended) — approximately 2.25x cheaper than the highest-priced options (Nebius and Lightning AI at $0.45).
DeepInfra delivers strong throughput: 459.3 tokens/sec, within 8% of the fastest provider (Lightning AI at 498.6 t/s).
DeepInfra is competitive on interactivity: 1.01s TTFT (3rd best), behind Baseten (0.56s) and Weights & Biases (0.73s).
Provider performance varies widely: output speed ranges from 498.6 t/s to 144.9 t/s — a ~3.4x spread — so provider choice materially impacts UX and throughput.
API feature coverage is mixed: function calling is supported by 3 of 5 providers (Weights & Biases, DeepInfra, Nebius); JSON mode by 2 of 5 (Weights & Biases, Nebius).
Benchmarks reflect sustained performance: median (P50) over the past 72 hours using a 10,000 input-token workload.

NVIDIA Nemotron 3 Super 120B — Best APIs

Provider	Why Notable	Blended ($/1M)	Speed (t/s)	Latency (TTFT)	Context	Tools
DeepInfra	Best price + strong speed/latency balance; supports function calling	$0.20	459.3	1.01s	262k	Yes
Baseten	Lowest latency (best TTFT) with near-top speed	$0.41	479.9	0.56s	203k	No
Lightning AI	Fastest output speed (max throughput)	$0.45	498.6	1.46s	256k	No
Nebius	High speed; supports JSON mode + function calling	$0.45	483.7	1.62s	256k	Yes
Weights & Biases	Low latency; supports JSON mode + function calling; low throughput	$0.35	144.9	0.73s	262k	Yes

Quick Verdict: Which Nemotron 3 Super Provider is Best?

Based on benchmarks across 5 tracked providers, DeepInfra is the recommended API for production-scale Nemotron 3 Super deployment. At $0.20/1M tokens, it is 55% cheaper than the most expensive providers while delivering 459.3 t/s — within 8% of the fastest option. For the lowest latency, Baseten leads at 0.56s TTFT. For maximum raw throughput, Lightning AI leads at 498.6 t/s.

Overall Winner: DeepInfra

DeepInfra secures the top spot by dominating the economic efficiency of serving Nemotron 3 Super 120B, while maintaining highly competitive performance across every other metric.

Input Price: $0.10 / 1M tokens
Output Price: $0.50 / 1M tokens
Blended Price: $0.20 / 1M tokens (cheapest on the market)
Output Speed: 459.3 t/s (within 8% of the fastest provider)
Latency (TTFT): 1.01s (3rd best overall)
Context Window: 262k tokens
API Features: Function Calling supported

The cost delta — $0.25 per million tokens saved compared to the market mean — makes DeepInfra the only logical choice for production-scale deployments. For most RAG or chat applications, the difference between 498 t/s (Lightning AI) and 459 t/s (DeepInfra) is imperceptible, while the 55% cost advantage compounds significantly at volume.

Best for Low Latency: Baseten

For applications requiring immediate feedback — such as voice-to-voice agents or highly responsive chat interfaces — Baseten is the technical leader.

Latency (TTFT): 0.56s (fastest in the benchmark)
Output Speed: 479.9 t/s (#3 overall)
Blended Price: $0.41 / 1M tokens
API Features: No JSON Mode or Function Calling

Baseten’s 0.56s TTFT beats the closest competitor by 0.17s and delivers a genuinely real-time feel for end-users. However, its pricing ($0.41/1M) is more than double DeepInfra’s, and it lacks support for JSON Mode and Function Calling — limiting its viability for complex agentic workflows.

Best for Raw Throughput: Lightning AI

Lightning AI is purpose-built for generation speed, making it the natural choice for high-volume batch processing jobs.

Output Speed: 498.6 t/s (fastest in the benchmark)
Latency (TTFT): 1.46s (one of the slower starts)
Blended Price: $0.45 / 1M tokens (tied most expensive)
API Features: No JSON Mode or Function Calling

Lightning AI’s 498.6 t/s is the fastest measured, but the 8.5% speed advantage over DeepInfra does not justify the 125% price premium for most use cases. Combined with the lack of JSON Mode and Function Calling, it is best reserved for offline batch workloads where cost is not a constraint.

Feature-Rich Alternative: Nebius

Nebius occupies a specific niche for developers requiring both Function Calling and JSON Mode — the only provider besides Weights & Biases to support both.

Output Speed: 483.7 t/s (#2 overall)
Latency (TTFT): 1.62s (highest in the benchmark)
Blended Price: $0.45 / 1M tokens (tied most expensive)
API Features: JSON Mode + Function Calling

Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling. It delivers solid throughput (483.7 t/s) but suffers from the highest latency in the benchmark (1.62s), making it unsuitable for real-time interfaces.

Developer Alternative: Weights & Biases

Weights & Biases presents an unusual performance profile, likely acting as a specialized developer-environment endpoint rather than a production inference backend.

Output Speed: 144.9 t/s (significantly slower than the rest of the field — ~3.4x below the leader)
Latency (TTFT): 0.73s (#2 lowest)
Blended Price: $0.35 / 1M tokens
API Features: JSON Mode + Function Calling

Despite strong latency and full feature support, its throughput bottleneck (144.9 t/s) makes it unsuitable for production traffic. It is best suited for short-context developer testing and evaluation environments.

Frequently Asked Questions

Which API provider is cheapest for NVIDIA Nemotron 3 Super?

DeepInfra is the cheapest provider at $0.20 blended per 1M tokens — roughly 55% cheaper than Nebius and Lightning AI, and 50% cheaper than Baseten.

Which provider has the fastest Time to First Token (TTFT)?

Baseten offers the fastest latency with a TTFT of 0.56s, making it ideal for real-time conversational applications.

Does DeepInfra support Function Calling for Nemotron 3 Super?

Yes, DeepInfra supports Function Calling, making it suitable for agentic workflows. Lightning AI and Baseten currently do not support this feature.

Is Nebius worth the extra cost?

Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling with no tolerance for prompt-engineering workarounds.

What makes Nemotron 3 Super different from other reasoning models?

Nemotron 3 Super uses a unique hybrid Mamba2-Transformer LatentMoE architecture, enabling 120B total parameters with only 12B active per inference. This delivers over 5x throughput compared to the previous Nemotron Super, while supporting a native 1M-token context window for long-running autonomous agents.

Conclusion

For the vast majority of Nemotron 3 Super 120B deployments, DeepInfra is the recommended provider. It offers the market’s lowest price ($0.20/1M), strong throughput (459.3 t/s), viable latency (1.01s), and Function Calling support — all without the significant cost premium of the competition.

Choose DeepInfra for the best overall value — lowest cost, strong throughput, and function calling support.
Choose Baseten if your application is latency-critical and every millisecond of TTFT counts.
Choose Lightning AI for pure bulk text generation where speed is the sole metric and cost is not a constraint.
Choose Nebius if native JSON Mode and Function Calling are both non-negotiable requirements.

DeepInfra Raises $107M Series B to Scale Inference InfrastructureDeepInfra has raised $107 million in Series B funding to scale its inference cloud, expand global capacity, and support the next generation of open-source and agentic AI workloads.

Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano Omni, the first multimodal model in the Nemotron 3 family — a single open model that understands images, video, audio, documents, and text in one unified inference pass.

DeepSeek V4 Pro: Model Overview, Features & Performance Guide<p>DeepSeek V4 Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model from DeepSeek, released on April 24, 2026 under the MIT license. It is designed for advanced reasoning, complex software engineering, and long-running agentic tasks, and arrives alongside DeepSeek-V4-Flash, a lighter 284B-parameter variant built for faster, lower-cost inference. The V4 series is DeepSeek’s first two-tier lineup […]</p>

View all