We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost
Published on 2026.04.03 by DeepInfra
NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost

About NVIDIA Nemotron 3 Super 120B A12B

NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging.

The model uses a hybrid Mamba2-Transformer LatentMoE architecture with Multi-Token Prediction (MTP), projecting tokens into a smaller latent dimension for expert routing and computation. This improves accuracy per byte and delivers over 5x throughput compared to the previous Nemotron Super generation. Notably, it is the first model in the Nemotron 3 family pre-trained using NVFP4 quantization — meaning it learned to be accurate within the constraints of 4-bit arithmetic from the first gradient update, not just at inference time.

Nemotron 3 Super supports a native 1 million token context window and responds to queries by first generating a reasoning trace before concluding with a final response, making it purpose-built for long-running autonomous agents and high-volume workloads such as IT ticket automation.

SpecificationDetails
ArchitectureMamba2-Transformer Hybrid Latent Mixture of Experts (LatentMoE) with Multi-Token Prediction (MTP)
Total Parameters120 billion
Active Parameters12 billion (per inference pass)
Context WindowUp to 1 million tokens
Training Data25 trillion tokens
Supported LanguagesEnglish, French, German, Italian, Japanese, Spanish, Chinese, plus 43 programming languages
Pre-training CutoffJune 2025
Post-training CutoffFebruary 2026

NVIDIA Nemotron 3 Super 120B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

NVIDIA Nemotron 3 Super 120B API Review Summary

  • DeepInfra is the lowest-cost provider at $0.20 / 1M tokens (blended) — approximately 2.25x cheaper than the highest-priced options (Nebius and Lightning AI at $0.45).
  • DeepInfra delivers strong throughput: 459.3 tokens/sec, within 8% of the fastest provider (Lightning AI at 498.6 t/s).
  • DeepInfra is competitive on interactivity: 1.01s TTFT (3rd best), behind Baseten (0.56s) and Weights & Biases (0.73s).
  • Provider performance varies widely: output speed ranges from 498.6 t/s to 144.9 t/s — a ~3.4x spread — so provider choice materially impacts UX and throughput.
  • API feature coverage is mixed: function calling is supported by 3 of 5 providers (Weights & Biases, DeepInfra, Nebius); JSON mode by 2 of 5 (Weights & Biases, Nebius).
  • Benchmarks reflect sustained performance: median (P50) over the past 72 hours using a 10,000 input-token workload.

NVIDIA Nemotron 3 Super 120B — Best APIs

ProviderWhy NotableBlended ($/1M)Speed (t/s)Latency (TTFT)ContextTools
DeepInfraBest price + strong speed/latency balance; supports function calling$0.20459.31.01s262kYes
BasetenLowest latency (best TTFT) with near-top speed$0.41479.90.56s203kNo
Lightning AIFastest output speed (max throughput)$0.45498.61.46s256kNo
NebiusHigh speed; supports JSON mode + function calling$0.45483.71.62s256kYes
Weights & BiasesLow latency; supports JSON mode + function calling; low throughput$0.35144.90.73s262kYes

Quick Verdict: Which Nemotron 3 Super Provider is Best?

Based on benchmarks across 5 tracked providers, DeepInfra is the recommended API for production-scale Nemotron 3 Super deployment. At $0.20/1M tokens, it is 55% cheaper than the most expensive providers while delivering 459.3 t/s — within 8% of the fastest option. For the lowest latency, Baseten leads at 0.56s TTFT. For maximum raw throughput, Lightning AI leads at 498.6 t/s.

Overall Winner: DeepInfra

DeepInfra secures the top spot by dominating the economic efficiency of serving Nemotron 3 Super 120B, while maintaining highly competitive performance across every other metric.

  • Input Price: $0.10 / 1M tokens
  • Output Price: $0.50 / 1M tokens
  • Blended Price: $0.20 / 1M tokens (cheapest on the market)
  • Output Speed: 459.3 t/s (within 8% of the fastest provider)
  • Latency (TTFT): 1.01s (3rd best overall)
  • Context Window: 262k tokens
  • API Features: Function Calling supported

The cost delta — $0.25 per million tokens saved compared to the market mean — makes DeepInfra the only logical choice for production-scale deployments. For most RAG or chat applications, the difference between 498 t/s (Lightning AI) and 459 t/s (DeepInfra) is imperceptible, while the 55% cost advantage compounds significantly at volume.

Best for Low Latency: Baseten

For applications requiring immediate feedback — such as voice-to-voice agents or highly responsive chat interfaces — Baseten is the technical leader.

  • Latency (TTFT): 0.56s (fastest in the benchmark)
  • Output Speed: 479.9 t/s (#3 overall)
  • Blended Price: $0.41 / 1M tokens
  • API Features: No JSON Mode or Function Calling

Baseten’s 0.56s TTFT beats the closest competitor by 0.17s and delivers a genuinely real-time feel for end-users. However, its pricing ($0.41/1M) is more than double DeepInfra’s, and it lacks support for JSON Mode and Function Calling — limiting its viability for complex agentic workflows.

Best for Raw Throughput: Lightning AI

Lightning AI is purpose-built for generation speed, making it the natural choice for high-volume batch processing jobs.

  • Output Speed: 498.6 t/s (fastest in the benchmark)
  • Latency (TTFT): 1.46s (one of the slower starts)
  • Blended Price: $0.45 / 1M tokens (tied most expensive)
  • API Features: No JSON Mode or Function Calling

Lightning AI’s 498.6 t/s is the fastest measured, but the 8.5% speed advantage over DeepInfra does not justify the 125% price premium for most use cases. Combined with the lack of JSON Mode and Function Calling, it is best reserved for offline batch workloads where cost is not a constraint.

Feature-Rich Alternative: Nebius

Nebius occupies a specific niche for developers requiring both Function Calling and JSON Mode — the only provider besides Weights & Biases to support both.

  • Output Speed: 483.7 t/s (#2 overall)
  • Latency (TTFT): 1.62s (highest in the benchmark)
  • Blended Price: $0.45 / 1M tokens (tied most expensive)
  • API Features: JSON Mode + Function Calling

Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling. It delivers solid throughput (483.7 t/s) but suffers from the highest latency in the benchmark (1.62s), making it unsuitable for real-time interfaces.

Developer Alternative: Weights & Biases

Weights & Biases presents an unusual performance profile, likely acting as a specialized developer-environment endpoint rather than a production inference backend.

  • Output Speed: 144.9 t/s (significantly slower than the rest of the field — ~3.4x below the leader)
  • Latency (TTFT): 0.73s (#2 lowest)
  • Blended Price: $0.35 / 1M tokens
  • API Features: JSON Mode + Function Calling

Despite strong latency and full feature support, its throughput bottleneck (144.9 t/s) makes it unsuitable for production traffic. It is best suited for short-context developer testing and evaluation environments.

Frequently Asked Questions

Which API provider is cheapest for NVIDIA Nemotron 3 Super?

DeepInfra is the cheapest provider at $0.20 blended per 1M tokens — roughly 55% cheaper than Nebius and Lightning AI, and 50% cheaper than Baseten.

Which provider has the fastest Time to First Token (TTFT)?

Baseten offers the fastest latency with a TTFT of 0.56s, making it ideal for real-time conversational applications.

Does DeepInfra support Function Calling for Nemotron 3 Super?

Yes, DeepInfra supports Function Calling, making it suitable for agentic workflows. Lightning AI and Baseten currently do not support this feature.

Is Nebius worth the extra cost?

Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling with no tolerance for prompt-engineering workarounds.

What makes Nemotron 3 Super different from other reasoning models?

Nemotron 3 Super uses a unique hybrid Mamba2-Transformer LatentMoE architecture, enabling 120B total parameters with only 12B active per inference. This delivers over 5x throughput compared to the previous Nemotron Super, while supporting a native 1M-token context window for long-running autonomous agents.

Conclusion

For the vast majority of Nemotron 3 Super 120B deployments, DeepInfra is the recommended provider. It offers the market’s lowest price ($0.20/1M), strong throughput (459.3 t/s), viable latency (1.01s), and Function Calling support — all without the significant cost premium of the competition.

  • Choose DeepInfra for the best overall value — lowest cost, strong throughput, and function calling support.
  • Choose Baseten if your application is latency-critical and every millisecond of TTFT counts.
  • Choose Lightning AI for pure bulk text generation where speed is the sole metric and cost is not a constraint.
  • Choose Nebius if native JSON Mode and Function Calling are both non-negotiable requirements.
Related articles
Deep Infra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeep Infra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeep Infra is serving the new, open NVIDIA Nemotron vision language and OCR AI models from day zero of their release. As a leading inference provider committed to performance and cost-efficiency, we're making these cutting-edge models available at the industry's best prices, empowering developers to build specialized AI agents without compromising on budget or performance.
Build a Streaming Chat Backend in 10 MinutesBuild a Streaming Chat Backend in 10 Minutes<p>When large language models move from demos into real systems, expectations change. The goal is no longer to produce clever text, but to deliver predictable latency, responsive behavior, and reliable infrastructure characteristics. In chat-based systems, especially, how fast a response starts often matters more than how fast it finishes. This is where token streaming becomes [&hellip;]</p>
Qwen3.5 27B API Benchmarks: Latency, Throughput & CostQwen3.5 27B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 27B (Reasoning) Qwen3.5 27B is part of Alibaba Cloud&#8217;s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench [&hellip;]</p>