We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

NVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & Cost
Published on 2026.04.03 by DeepInfra
NVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & Cost

About NVIDIA Nemotron 3 Nano 30B A3B

NVIDIA Nemotron 3 Nano 30B A3B is a large language model trained from scratch by NVIDIA, designed as a unified model for both reasoning and non-reasoning tasks. It is part of the Nemotron 3 family — NVIDIA’s most efficient family of open models, built for agentic AI applications.

The model employs a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture, consisting of 23 Mamba-2 layers, 23 MoE layers, and 6 Attention layers using grouped query attention (GQA). Each MoE layer includes 128 routed experts with 6 activated per token, plus shared experts activated on all tokens. This yields approximately 31.6 billion total parameters with only ~3.2–3.6 billion active parameters per forward pass — delivering the reasoning quality of a much larger model at the speed and cost profile of a lightweight architecture.

Trained on approximately 25 trillion tokens covering code, math, science, and general knowledge, the model supports multiple languages and 43 programming languages. Reasoning capabilities can be toggled via the chat template — in non-reasoning mode (as benchmarked here), the model provides direct answers without intermediate reasoning traces, trading a slight decrease in accuracy on harder prompts for faster response times. In an 8K input / 16K output configuration on a single H200 GPU, the model achieves 3.3x higher throughput than Qwen3-30B-A3B.

Key Capabilities

  • Math and coding excellence
  • Multi-step tool calling
  • Multi-turn agentic workflows
  • Instruction following
  • Structured outputs (JSON mode)
  • Reasoning ON/OFF modes with configurable thinking budgets

NVIDIA Nemotron 3 Nano 30B A3B is now available via DeepInfra — this analysis breaks down the key performance metrics developers need to evaluate before deploying.

NVIDIA Nemotron 3 Nano 30B A3B API Review Summary

  • DeepInfra is the only benchmarked provider for Nemotron 3 Nano 30B A3B, leading across all key metrics: speed, latency, and price.
  • Output speed: 93.7 tokens/sec (P50 median over the past 72 hours, 10,000 input token workload).
  • Low interactive latency: 0.45s TTFT — sub-half-second initial response.
  • Cost profile: $0.09 blended / 1M tokens (3:1 input:output blend).
  • Token pricing: $0.05 / 1M input tokens and $0.20 / 1M output tokens.
  • End-to-end response time: 5.78s to output 500 tokens.
  • Context window: 262k tokens.
  • Full API feature support: Both JSON Mode and Function Calling are supported.

Quick Summary of DeepInfra

DeepInfra is the only API provider for Nemotron 3 Nano 30B A3B deployment. It delivers a 93.7 t/s output speed, a 0.45s TTFT, and a blended price of $0.09/1M tokens. Unlike the larger Nemotron 3 Super, this model also supports JSON Mode in addition to Function Calling, making it a strong fit for structured output workflows and agentic pipelines alike.

Latency: 0.45s Time to First Token

For interactive AI applications, chatbots, and real-time agentic workflows, TTFT is the most critical user-facing metric. DeepInfra records a median TTFT of 0.45 seconds — measured after processing a 10,000 input token workload, which includes initial input processing and generation of the first response token.

A sub-half-second TTFT effectively eliminates cold start delays for real-time applications, making it well suited for coding assistants, multi-turn agentic workflows, and any application requiring immediate perceived responsiveness.

Output Speed: 93.7 Tokens per Second

DeepInfra achieves 93.7 tokens per second — a sustained P50 measurement over a 72-hour period. For a model with only ~3.2–3.6 billion active parameters per forward pass, this represents exceptional throughput efficiency.

At 93.7 t/s, the model can generate detailed code completions, multi-step reasoning traces, and long-form responses rapidly. Combined with the 0.45s TTFT, it delivers both fast starts and sustained generation speed across the full response.

End-to-End Response Time: 5.78 Seconds for 500 Tokens

DeepInfra completes a full 500-token output in 5.78 seconds — composed of the 0.45s TTFT and the generation time. This benchmarks the non-reasoning variant of the model, which omits intermediate thinking traces and delivers direct answers, keeping E2E times low.

This predictable and stable E2E latency makes it well suited for multi-step agentic workflows where consistent response times are important for downstream task orchestration.

Cost Efficiency: $0.09 Blended Price per 1M Tokens

DeepInfra offers highly competitive pricing for Nemotron 3 Nano 30B A3B inference:

  • Input Price: $0.05 per 1M tokens
  • Output Price: $0.20 per 1M tokens
  • Blended Price: $0.09 per 1M tokens (3:1 input:output ratio)

At $0.09 blended per million tokens, Nemotron 3 Nano is one of the most cost-effective options available for an agentic-capable model with full JSON mode and function calling support. The low input pricing ($0.05/1M) makes it particularly economical for RAG architectures and long-context workflows.

Context Window and API Features

DeepInfra’s deployment supports a 262k token context window alongside both JSON Mode and Function Calling — offering complete API feature parity for production agentic applications. Unlike the larger Nemotron 3 Super 120B, which supports Function Calling only, this model adds native JSON Mode support, enabling deterministic structured outputs without additional prompt engineering overhead.

The 262k context window supports extensive document analysis, long conversation histories, and large codebase processing in a single API request. The model’s hybrid Mamba-Transformer architecture is specifically designed to maintain strong long-context fidelity while keeping latency low.

Conclusion

For developers deploying NVIDIA Nemotron 3 Nano 30B A3B, DeepInfra is the way to go. It combines a sub-half-second TTFT (0.45s), solid throughput (93.7 t/s), a blended price of $0.09 per million tokens, and full support for both JSON Mode and Function Calling — making it a compelling, cost-effective foundation for agentic AI applications, coding assistants, and structured output pipelines.

Related articles
Build a Streaming Chat Backend in 10 MinutesBuild a Streaming Chat Backend in 10 Minutes<p>When large language models move from demos into real systems, expectations change. The goal is no longer to produce clever text, but to deliver predictable latency, responsive behavior, and reliable infrastructure characteristics. In chat-based systems, especially, how fast a response starts often matters more than how fast it finishes. This is where token streaming becomes [&hellip;]</p>
DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost AnalysisDeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis<p>About DeepSeek V4 Pro DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek&#8217;s flagship open-weight models. The model introduces a hybrid attention [&hellip;]</p>
Qwen3.5 2B via DeepInfra: Latency, Throughput & CostQwen3.5 2B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 2B (Reasoning) Qwen3.5 2B is a compact 2-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud&#8217;s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural [&hellip;]</p>