Best API Providers for NVIDIA Nemotron 3 Super 120B

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.05.25 by DeepInfra

Nemotron 3 Super 120B is available across a growing number of hosted APIs and deployment platforms. At 120B total parameters with 12B active per inference pass, the right provider matters: latency, throughput, and cost vary significantly depending on where you run it. This guide covers the top options by use case — from fully managed APIs to dedicated GPU deployments and no-code routing layers. For a detailed cost breakdown, see the Nemotron 3 Super pricing guide.

Summary of Top Providers by Use Case

Best For	Provider
Best overall value & cost	DeepInfra
Best for interactive applications	CoreWeave
Best for latency-critical & voice agents	Baseten
Best for high-volume batch processing	Lightning AI
Best for complex agentic workflows	Nebius
Best for AWS enterprise integration	Amazon Bedrock
Best for flexible deployment options	Qubrid AI
Best for asynchronous workloads	Doubleword
Best for high availability with routing fallback	OpenRouter

Detailed Provider Reviews

DeepInfra

DeepInfra is the recommended option for most production Nemotron 3 Super deployments. It delivers the lowest blended price in the benchmarked set at $0.20 per 1M tokens, with strong output speed (459.3 t/s), competitive TTFT (1.01s), and full support for function calling. The platform runs on bare-metal infrastructure, is typically 50–80% cheaper than major cloud alternatives, and is SOC 2 and ISO 27001 certified. Public and private endpoint deployment are both available.

Key features:

Lowest blended price at $0.20/1M tokens; $0.10/1M input, $0.50/1M output
459.3 t/s output speed
1.01s TTFT
Function calling and JSON mode supported
262k context window
Public and private endpoints; SOC 2 and ISO 27001 certified

For a full breakdown of workload cost scenarios on DeepInfra, see the Nemotron 3 Super pricing guide.

CoreWeave

CoreWeave is highlighted in Artificial Analysis benchmarks for offering competitive sub-second TTFT and low blended pricing. It is a strong fit for real-time inference and cost-sensitive workloads where rapid first response matters.

Key features:

$0.26/1M tokens blended price
0.98s TTFT (fastest in the Artificial Analysis benchmark set)
154.4 t/s output speed
Function calling and JSON mode supported
262k context window

Baseten

Baseten is purpose-built for latency-critical applications. Its 0.56s TTFT is the fastest measured across benchmarked providers — a meaningful advantage for voice-to-voice agents or any interface where perceived responsiveness depends on getting a first response quickly.

Key features:

0.56s TTFT (fastest across all benchmarked providers)
479.9 t/s output speed
$0.41/1M tokens blended price
203k context window

Lightning AI

Lightning AI leads the benchmarked set on raw output speed at 509.3 t/s — the right choice when sustained generation throughput is the primary constraint, such as high-volume batch processing or document generation pipelines.

Key features:

509.3 t/s output speed (fastest in the set)
JSON mode supported
256k context window
$0.39–0.45/1M tokens blended price depending on benchmark source

Nebius

Nebius provides full support for both JSON mode and function calling at high output speeds, making it a solid fit for developers building structured, multi-step agentic workflows that require reliable tool orchestration.

Key features:

JSON mode and function calling both supported
Up to 483.7 t/s output speed
256k context window
$0.36–0.45/1M tokens blended price

Amazon Bedrock

Amazon Bedrock added Nemotron 3 Super on March 18, 2026, providing fully managed access through a single AWS API — no infrastructure to provision. It is the natural choice for enterprise teams already operating within the AWS ecosystem who need compliance, cross-region routing, and flexible service tiers.

Key features:

Access via bedrock-runtime and bedrock-mantle endpoints
Client-side and server-side tool calling supported
Standard, Priority, Flex, and Reserved service tiers
Cross-region routing (Geo and Global Cross-Region)
256k context window, up to 32k output tokens

Qubrid AI

Qubrid AI offers a range of deployment options from simple serverless API access to dedicated GPU VMs and Kubernetes deployments, bridging the gap between managed inference and custom infrastructure.

Key features:

Serverless API at $0.10/1M input, $0.50/1M output tokens
Dedicated cloud GPU VMs from $1.25/GPU/hr
Official Docker images for containerized deployments
Production-grade Kubernetes manifests and Helm charts
SDKs for Python, JavaScript, Go, and Java

Doubleword

Doubleword focuses on workload flexibility with distinct pricing tiers and a batch processing API for asynchronous inference — useful for teams that want to optimize cost by decoupling generation from real-time latency requirements.

Key features:

Standard, Async, and Realtime pricing tiers
Batch processing API for asynchronous workloads
OpenAI-compatible endpoints
256k context window

OpenRouter

OpenRouter is a unified API routing layer that provides access to Nemotron 3 Super through automatic provider routing and fallback mechanisms. It also offers a free variant (nvidia/nemotron-3-super-120b-a12b:free) with a 1M context window, useful for non-production testing. Current pricing on the paid tier: $0.10/1M input, $0.50/1M output.

Key features:

Unified OpenAI-compatible API with automatic provider routing
Fallback mechanisms to maximize uptime
$0.10/1M input, $0.50/1M output on paid tier
Free variant available with 1M token context window
1M token context window (paid tier)

Conclusion

Provider choice for Nemotron 3 Super depends on what your workload actually optimizes for:

Production deployments at scale: DeepInfra — lowest blended cost, full function calling, private endpoints
Interactive and latency-critical apps: Baseten (0.56s TTFT) or CoreWeave (0.98s TTFT, lowest blended in AA benchmark set)
High-volume batch processing: Lightning AI — 509.3 t/s output speed
Complex agentic workflows needing JSON + function calling: Nebius
AWS enterprise integration: Amazon Bedrock — fully managed, compliant, cross-region
Flexible self-hosted or dedicated GPU: Qubrid AI
Async batch workloads: Doubleword
High availability with routing fallback: OpenRouter

For most production-scale deployments, DeepInfra is the strongest starting point: lowest blended price, full API feature support, and the infrastructure reliability that comes with bare-metal deployment. The API benchmarks for Nemotron 3 Super and the Nemotron 3 Nano explainer are useful companion reads when evaluating the full Nemotron family.

How to deploy google/flan-ul2 - simple. (open source ChatGPT alternative)Flan-UL2 is probably the best open source model available right now for chatbots. In this post we will show you how to get started with it very easily. Flan-UL2 is large - 20B parameters. It is fine tuned version of the UL2 model using Flan dataset. Because this is quite a large model it is not eas...

GLM-5.1 on DeepInfra: Z.AI’s Agentic Engineering Model<p>Z.AI’s GLM-5.1 scores 58.4 on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3) and GPT-5.4 (57.7) on real-world software engineering tasks. It’s the direct successor to GLM-5, designed for agentic engineering: long-horizon coding tasks, terminal operations, and repository-level work. The core design premise is that previous models, including GLM-5, tend to plateau after […]</p>

MiMo-V2.5 Provider Pricing and Deployment Guide<p>MiMo-V2.5 is worth paying attention to because it puts three things developers usually have to trade off into the same conversation: open weights, a 1 million-token model design, and pricing that can be unusually low depending on where you buy it. On Xiaomi’s first-party API, Artificial Analysis lists MiMo-V2.5 at $0.14 per 1M input tokens […]</p>

View all