We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

NVIDIA Nemotron 3 Super on DeepInfra: 120B MoE Model
Published on 2026.05.25 by DeepInfra
NVIDIA Nemotron 3 Super on DeepInfra: 120B MoE Model

NVIDIA’s Nemotron 3 Super runs 120 billion parameters while activating only 12 billion per token — a ratio that makes a real difference when orchestrating multiple agents in parallel. It’s built on a novel architecture called LatentMoE, a hybrid of Mamba-2, Mixture-of-Experts, and Attention layers designed from the ground up for agentic, reasoning, and long-context workloads. The model supports a context window of up to 1 million tokens, and that’s not just a spec on paper.

On the RULER benchmark at 1 million tokens, Nemotron 3 Super scores 91.75 — compared to 22.30 for GPT-OSS-120B at the same length, which tells you something real about how the architecture holds up under pressure. Beyond long-context, it ships with a configurable reasoning mode that can be toggled on or off at inference time, making it practical across both deep reasoning tasks and lighter conversational use. It was pre-trained on over 25 trillion tokens spanning code, math, science, and general knowledge, post-trained with multi-stage reinforcement learning, and released under NVIDIA’s open commercial license. You can find the full model card and endpoint details on the NVIDIA Nemotron 3 Super model page.

What Makes This Model Different

LatentMoE: a hybrid architecture worth understanding. Nemotron 3 Super uses an architecture NVIDIA calls LatentMoE — a combination of Mamba-2 state space layers, Mixture-of-Experts routing, and standard Attention, augmented with Multi-Token Prediction (MTP) heads. The MoE routing happens in a projected latent dimension rather than full model dimension, which improves compute efficiency per token. The MTP heads use shared weights across prediction steps, enabling native speculative decoding without a separate draft model — which has real implications for inference throughput. For latency and throughput numbers across configurations, see the Nemotron 3 Super API benchmarks.

120B total parameters, 12B active. At inference time, only 12B parameters are activated per token — keeping compute costs close to a 12B dense model while retaining the capacity of a 120B one. It was also the first model in the Nemotron 3 family pre-trained using NVFP4 precision, then released as a BF16 checkpoint. Pre-training covered 25 trillion tokens across code, math, science, and general knowledge, spanning 20 languages and 43 programming languages. If you want to understand where Nemotron 3 Super sits relative to the smaller end of the family, the Nemotron 3 Nano explainer covers the tradeoffs between model sizes well.

Long-context retrieval holds up at 1M tokens. The default context window is 256k (limited by VRAM), but the model supports up to 1M tokens. RULER benchmark scores show strong retention across the range:

Context LengthNemotron 3 SuperQwen3.5-122B-A10BGPT-OSS-120B
RULER @ 256k96.3096.7452.30
RULER @ 512k95.6795.9546.70
RULER @ 1M91.7591.3322.30

At 1M tokens, Nemotron 3 Super edges past Qwen3.5-122B-A10B — and GPT-OSS-120B essentially falls apart at that range.

Math, code, and science benchmarks — a mixed but competitive picture. The model leads on several science and math benchmarks (HMMT Feb25, SciCode), holds competitive on coding (LiveCodeBench v5: 81.19), and shows strength in multilingual software engineering (SWE-Bench Multilingual via OpenHands: 45.78 vs. GPT-OSS-120B’s 30.80). It trails on GPQA and HLE without tools, though tool-augmented scores close some of that gap. For a direct head-to-head on coding and reasoning tasks, the Nemotron 3 Nano vs GPT-OSS-20B comparison offers useful context on how the Nemotron family generally holds up against OpenAI-class models:

BenchmarkNemotron 3 SuperQwen3.5-122B-A10BGPT-OSS-120B
HMMT Feb25 (no tools)93.6791.4090.00
GPQA (with tools)82.7080.09
LiveCodeBench v581.1978.9388.00
SciCode (subtask)42.0542.0039.00
SWE-Bench Multilingual45.7830.80

Configurable reasoning and agentic-first design. The model’s thinking behavior can be toggled per request via chat template parameters (enable_thinking=True/False), with an additional low_effort mode for lighter reasoning tasks. Post-training went through three explicit stages: SFT, then RL via asynchronous GRPO across math, code, science, tool use, and multi-turn conversation, followed by RLHF for conversational quality. It’s compatible with vLLM, SGLang, TRT-LLM, and OpenCode, making it straightforward to drop into existing agent scaffolds. Hardware baseline is 8× H100-80GB, dropping to 2× B200/B300 GPUs due to higher HBM capacity on Blackwell.

Getting Started on DeepInfra

Nemotron 3 Super is available on DeepInfra as a public endpoint under the model ID nvidia/NVIDIA-Nemotron-3-Super-120B-A12B. Pricing is $0.10 per 1M input tokens and $0.50 per 1M output tokens — usage-based, no commitments. For a full breakdown of how that compares across the Nemotron family, the NVIDIA Nemotron API pricing guide is worth a read. If you need a dedicated setup, private endpoint deployment is also available through the DeepInfra dashboard.

The API is OpenAI-compatible — swap your base URL, point to the model, and your existing SDK code works as-is. DeepInfra operates with a zero-retention policy and holds both SOC 2 and ISO 27001 certifications. The API reference for Nemotron 3 Super covers authentication, request parameters, and response schema.

Here’s a minimal example to get your first completion:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
      "messages": [
        {
          "role": "user",
          "content": "Explain the difference between MoE and dense transformer architectures."
        }
      ]
    }'
copy
from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
    messages=[
        {
            "role": "user",
            "content": "Explain the difference between MoE and dense transformer architectures."
        }
    ],
)
print(response.choices[0].message.content)
copy
import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
  messages: [
    {
      role: "user",
      content: "Explain the difference between MoE and dense transformer architectures.",
    },
  ],
});
console.log(response.choices[0].message.content);
copy

The model supports tool/function calling on the same endpoint. If you want to explore the broader set of models available — including other members of the Nemotron family — the DeepInfra models page is a good starting point. To grab your API key and get started, head to the Nemotron 3 Super model page.

Conclusion

Nemotron 3 Super is a 120B-parameter model that runs at 12B active cost, holds together at 1M context lengths where competitors don’t, and ships with agentic scaffolding hooks that have historically required separate tooling to wire up. That combination of long-context reliability, configurable reasoning, and native tool use makes it a practical foundation for multi-agent pipelines, complex document workflows, and inference-time compute budgeting at scale.

If you’re building systems where context depth and per-token cost need to coexist without compromise, it’s worth evaluating. The Nemotron 3 Super release post has additional background on design decisions and intended use cases. To get started, visit the model page on DeepInfra.

Related articles
FLUX.1-dev Guide: Mastering Text-to-Image AI Prompts for Stunning and Consistent VisualsFLUX.1-dev Guide: Mastering Text-to-Image AI Prompts for Stunning and Consistent VisualsLearn how to craft compelling prompts for FLUX.1-dev to create stunning images.
Qwen3.5 397B A17B API Benchmarks: Latency, Throughput & CostQwen3.5 397B A17B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 397B A17B Qwen3.5 397B A17B is Alibaba Cloud&#8217;s largest and most capable multimodal foundation model, released in February 2026. It features a hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters and 17 billion active parameters per inference pass, utilizing 512 experts with a routing mechanism selecting a subset per token. This sparse [&hellip;]</p>
Qwen3.5 2B via DeepInfra: Latency, Throughput & CostQwen3.5 2B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 2B (Reasoning) Qwen3.5 2B is a compact 2-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud&#8217;s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural [&hellip;]</p>