DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

The NVIDIA Nemotron 3 Super is a state-of-the-art 120-billion parameter hybrid Mixture-of-Experts (MoE) model designed to bridge the gap between high-compute efficiency and extreme accuracy. Engineered specifically for the next generation of AI development, Nemotron 3 Super excels in multi-agent applications, specialized agentic systems, and complex reasoning tasks. By utilizing a sophisticated architecture that activates only 12 billion parameters at any given time, the model provides the performance of a massive LLM with the agility required for real-time, collaborative AI workflows.
NVIDIA has optimized the Nemotron 3 Super (specifically the 120B-A12B variant) to run numerous collaborating agents simultaneously on a single GPU. This is achieved through a Latent Mixture-of-Experts (LatentMoE) framework, which projects tokens into a smaller latent dimension for routing, significantly reducing compute overhead.
The Nemotron 3 Super demonstrates specialized capabilities in agentic workflows, scientific reasoning, and autonomous software engineering. It consistently outperforms peer models in its parameter class across critical benchmarks.
| Benchmark Category | Benchmark Name | Score | Metric |
|---|---|---|---|
| General Knowledge | MMLU-Pro | 83.73 | Accuracy (%) |
| Reasoning | AIME25 (No Tools) | 90.21 | Accuracy (%) |
| Reasoning | HMMT Feb25 (No Tools) | 93.67 | Accuracy (%) |
| Coding | LiveCodeBench (v5) | 81.19 | Pass@1 (%) |
| Human Preference | Arena-Hard-V2 | 73.88 | Score |
| Benchmark | Nemotron 3 Super | Qwen3.5-122B-A10B | GPT-OSS-120B |
|---|---|---|---|
| MMLU-Pro | 83.73 | 86.70 | 81.00 |
| HMMT Feb25 | 93.67 | 91.40 | 90.00 |
| RULER @ 1M Context | 91.75 | 91.33 | 22.30 |
| SWE-Bench | 60.47 | 66.40 | 41.90 |
Nemotron 3 Super is available for public deployment via the DeepInfra inference cloud. DeepInfra provides an OpenAI-compatible endpoint, making it easy for developers to integrate the model into existing applications.
Access requires an API key obtained from your DeepInfra dashboard. Include this key in the Authorization header of your requests:
Authorization: Bearer YOUR_DEEPINFRA_API_KEY
import requests
import os
DEEPINFRA_API_KEY = os.getenv("DEEPINFRA_API_KEY", "YOUR_DEEPINFRA_API_KEY")
url = "https://api.deepinfra.com/v1/openai/chat/completions"
headers = {
"Authorization": f"Bearer {DEEPINFRA_API_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain the concept of quantum entanglement in simple terms."}
],
"max_new_tokens": 150,
"temperature": 0.7
}
response = requests.post(url, headers=headers, json=payload)
print(response.json())cURL Example
curl -X POST \
https://api.deepinfra.com/v1/openai/chat/completions \
-H "Authorization: Bearer YOUR_DEEPINFRA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
"messages": [
{"role": "user", "content": "Explain quantum entanglement."}
],
"max_new_tokens": 150
}'DeepInfra offers a highly competitive, usage-based pricing model for Nemotron 3 Super, allowing developers to scale from prototyping to enterprise production without massive upfront costs.
For users looking to deploy the model on private infrastructure, the following hardware configurations are recommended:
* Minimum: 8× NVIDIA H100-80GB GPUs.
* Optimized: Fully compatible with NVIDIA Grace Blackwell (GB200) systems. On B200/B300 hardware, the BF16 checkpoint can fit on as few as 2 GPUs due to increased HBM capacity.
The NVIDIA Nemotron 3 Super represents a significant milestone in Mixture-of-Experts technology. By combining a massive 120B parameter knowledge base with a highly efficient 12B active parameter execution, it offers a unique value proposition: enterprise-grade reasoning and multi-agent collaboration at a fraction of the traditional compute cost. Whether you are building autonomous software agents or processing million-token documents, Nemotron 3 Super provides the accuracy and efficiency required for modern AI systems.
For the latest updates and community milestones, visit the official NVIDIA news section or the DeepInfra blog.
Kimi K2.6 Model Overview: Architecture, Features & Capabilities<p>Kimi K2.6 is Moonshot AI’s latest flagship open-source model, released on April 20, 2026 under a Modified MIT license. It is a native multimodal agentic model built on a 1-trillion parameter Mixture-of-Experts (MoE) architecture, with 32 billion parameters activated per token. The model is designed for long-horizon coding, autonomous execution, and multi-agent orchestration, and is […]</p>
Best OpenClaw Alternatives: Hermes Agent, ZeroClaw & NemoClaw<p>OpenClaw has 362,000 GitHub stars and a skill marketplace with over 44,000 community contributions. That kind of adoption doesn’t happen by accident. Still, the same teams running it in production keep running into the same complaint: the model list is fixed. OpenClaw’s guided setup wizard covers OpenAI, Anthropic, Google, DeepSeek, and local Ollama. You can […]</p>
NVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Nano 30B A3B NVIDIA Nemotron 3 Nano 30B A3B is a large language model trained from scratch by NVIDIA, designed as a unified model for both reasoning and non-reasoning tasks. It is part of the Nemotron 3 family — NVIDIA’s most efficient family of open models, built for agentic AI applications. […]</p>
© 2026 DeepInfra. All rights reserved.