We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Introducing Nemotron 3 Super on DeepInfra
Published on 2026.03.11 by Aray Sultanbekova
Introducing Nemotron 3 Super on DeepInfra

Introducing NVIDIA Nemotron 3 Super on DeepInfra

We are excited to announce that DeepInfra is an official launch partner for NVIDIA Nemotron 3 Super, the latest open model in the Nemotron family.

Nemotron 3 Super is purpose-built for complex multi-agent applications, delivering high reasoning accuracy, fast inference, and a 1M token context window—all while running efficiently on a single NVIDIA GPU. On DeepInfra, the model is available from day one with zero setup, low latency, and no operational overhead.

With its hybrid MoE architecture and 120B total / 12B active parameters, Super is designed to run multiple collaborating agents per application efficiently. Paired with DeepInfra's high-efficiency inference platform and usage-based pricing, you can build and scale agentic workflows using only a few lines of code.

What makes Nemotron 3 Super different

Nemotron 3 Super uses a hybrid architecture that combines Mixture of Experts (MoE) with the Mamba transformer design — the same architectural foundation as Nemotron 3 Nano, scaled and optimized for heavier agentic workloads.

Most layers rely on Mamba for high-throughput sequence processing, while transformer layers handle complex reasoning.

On top of this architecture, Latent MoE activates 4 experts for the inference cost of one, and Multi-Token Prediction (MTP) accelerates long-form generation by predicting multiple tokens per forward pass.

These design choices enable:

  • Up to 5x higher throughput compared to previous Nemotron Super model
  • Up to 2x higher accuracy compared to previous Nemotron Super model
  • Consistent performance across variable, multi-step workloads

The 1M-token context window is a core part of the model's design. For agentic systems, this means holding full conversation history, tool call traces, and plan state across long workflows — without losing coherence or truncating critical context mid-task.

Nemotron 3 Super achieves leading scores on reasoning and agentic benchmarks including AIME 2025, IFBench, TerminalBench, and RULER. For more details on the model, check out the technical blog from NVIDIA.

Like the rest of the Nemotron family, Nemotron 3 Super is fully open: weights, training datasets, and development recipes are all publicly available. NVIDIA trained the model on high-quality synthetic data generated from frontier open reasoning models, giving teams full transparency to inspect, customize, and fine-tune the model for their specific needs.


Getting started on DeepInfra

Nemotron 3 Super is accessible via DeepInfra's OpenAI-compatible API. Here's how to get started in a few lines of code.

Install the client:

pip install openai
copy

Run your first inference:

from openai import OpenAI

client = OpenAI(
    api_key="<your-deepinfra-api-key>",
    base_url="https://api.deepinfra.com/v1/openai",
)

response = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the key findings from this medical billing report."},
    ],
)

print(response.choices[0].message.content)
copy

Streaming is supported out of the box:

stream = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
    messages=[
        {"role": "user", "content": "Walk me through a multi-step plan to optimize revenue cycle management."}
    ],
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
copy

Enterprise-Grade Security and Privacy

DeepInfra operates with a zero-retention policy. Inputs, outputs, and user data are not stored. The platform is SOC 2 and ISO 27001 certified, following industry best practices for security and privacy. More information is available in our DeepInfra Trust Center.

Start Building

Visit the Nemotron 3 Super model page on DeepInfra to explore pricing and start inference instantly. You can check out our documentation to learn more about the broader model ecosystem and developer resources.

Have questions or need help?

Reach out to us at feedback@deepinfra.com, join our Discord, or connect with us on X (@DeepInfra) — we're happy to help.

Related articles
A Milestone on Our Journey Building DeepInfra and Scaling Open Source AI InfrastructureA Milestone on Our Journey Building DeepInfra and Scaling Open Source AI InfrastructureToday we're excited to share that DeepInfra has raised $18 million in Series A funding, led by Felicis and our earliest believer and advisor Georges Harik.
Kimi K2.6 Pricing Guide 2026: Compare Costs & Deployment StrategiesKimi K2.6 Pricing Guide 2026: Compare Costs & Deployment Strategies<p>Kimi K2.6 matters because it sits in a rare spot: open weights, broad provider availability, and a real spread in pricing and runtime performance depending on where you buy it. Artificial Analysis tracks the model across nine API providers, with blended pricing ranging from $1.15 to $2.15 per 1M tokens and major differences in throughput [&hellip;]</p>
Qwen3.5 2B via DeepInfra: Latency, Throughput & CostQwen3.5 2B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 2B (Reasoning) Qwen3.5 2B is a compact 2-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud&#8217;s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural [&hellip;]</p>