We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Best API Providers for NVIDIA Nemotron 3 Super 120B
Published on 2026.05.25 by DeepInfra
Best API Providers for NVIDIA Nemotron 3 Super 120B

Nemotron 3 Super 120B is available across a growing number of hosted APIs and deployment platforms. At 120B total parameters with 12B active per inference pass, the right provider matters: latency, throughput, and cost vary significantly depending on where you run it. This guide covers the top options by use case — from fully managed APIs to dedicated GPU deployments and no-code routing layers. For a detailed cost breakdown, see the Nemotron 3 Super pricing guide.

Summary of Top Providers by Use Case

Best ForProvider
Best overall value & costDeepInfra
Best for interactive applicationsCoreWeave
Best for latency-critical & voice agentsBaseten
Best for high-volume batch processingLightning AI
Best for complex agentic workflowsNebius
Best for AWS enterprise integrationAmazon Bedrock
Best for flexible deployment optionsQubrid AI
Best for asynchronous workloadsDoubleword
Best for high availability with routing fallbackOpenRouter

Detailed Provider Reviews

DeepInfra

DeepInfra is the recommended option for most production Nemotron 3 Super deployments. It delivers the lowest blended price in the benchmarked set at $0.20 per 1M tokens, with strong output speed (459.3 t/s), competitive TTFT (1.01s), and full support for function calling. The platform runs on bare-metal infrastructure, is typically 50–80% cheaper than major cloud alternatives, and is SOC 2 and ISO 27001 certified. Public and private endpoint deployment are both available.

Key features:

  • Lowest blended price at $0.20/1M tokens; $0.10/1M input, $0.50/1M output
  • 459.3 t/s output speed
  • 1.01s TTFT
  • Function calling and JSON mode supported
  • 262k context window
  • Public and private endpoints; SOC 2 and ISO 27001 certified

For a full breakdown of workload cost scenarios on DeepInfra, see the Nemotron 3 Super pricing guide.

CoreWeave

CoreWeave is highlighted in Artificial Analysis benchmarks for offering competitive sub-second TTFT and low blended pricing. It is a strong fit for real-time inference and cost-sensitive workloads where rapid first response matters.

Key features:

  • $0.26/1M tokens blended price
  • 0.98s TTFT (fastest in the Artificial Analysis benchmark set)
  • 154.4 t/s output speed
  • Function calling and JSON mode supported
  • 262k context window

Baseten

Baseten is purpose-built for latency-critical applications. Its 0.56s TTFT is the fastest measured across benchmarked providers — a meaningful advantage for voice-to-voice agents or any interface where perceived responsiveness depends on getting a first response quickly.

Key features:

  • 0.56s TTFT (fastest across all benchmarked providers)
  • 479.9 t/s output speed
  • $0.41/1M tokens blended price
  • 203k context window

Lightning AI

Lightning AI leads the benchmarked set on raw output speed at 509.3 t/s — the right choice when sustained generation throughput is the primary constraint, such as high-volume batch processing or document generation pipelines.

Key features:

  • 509.3 t/s output speed (fastest in the set)
  • JSON mode supported
  • 256k context window
  • $0.39–0.45/1M tokens blended price depending on benchmark source

Nebius

Nebius provides full support for both JSON mode and function calling at high output speeds, making it a solid fit for developers building structured, multi-step agentic workflows that require reliable tool orchestration.

Key features:

  • JSON mode and function calling both supported
  • Up to 483.7 t/s output speed
  • 256k context window
  • $0.36–0.45/1M tokens blended price

Amazon Bedrock

Amazon Bedrock added Nemotron 3 Super on March 18, 2026, providing fully managed access through a single AWS API — no infrastructure to provision. It is the natural choice for enterprise teams already operating within the AWS ecosystem who need compliance, cross-region routing, and flexible service tiers.

Key features:

  • Access via bedrock-runtime and bedrock-mantle endpoints
  • Client-side and server-side tool calling supported
  • Standard, Priority, Flex, and Reserved service tiers
  • Cross-region routing (Geo and Global Cross-Region)
  • 256k context window, up to 32k output tokens

Qubrid AI

Qubrid AI offers a range of deployment options from simple serverless API access to dedicated GPU VMs and Kubernetes deployments, bridging the gap between managed inference and custom infrastructure.

Key features:

  • Serverless API at $0.10/1M input, $0.50/1M output tokens
  • Dedicated cloud GPU VMs from $1.25/GPU/hr
  • Official Docker images for containerized deployments
  • Production-grade Kubernetes manifests and Helm charts
  • SDKs for Python, JavaScript, Go, and Java

Doubleword

Doubleword focuses on workload flexibility with distinct pricing tiers and a batch processing API for asynchronous inference — useful for teams that want to optimize cost by decoupling generation from real-time latency requirements.

Key features:

  • Standard, Async, and Realtime pricing tiers
  • Batch processing API for asynchronous workloads
  • OpenAI-compatible endpoints
  • 256k context window

OpenRouter

OpenRouter is a unified API routing layer that provides access to Nemotron 3 Super through automatic provider routing and fallback mechanisms. It also offers a free variant (nvidia/nemotron-3-super-120b-a12b:free) with a 1M context window, useful for non-production testing. Current pricing on the paid tier: $0.10/1M input, $0.50/1M output.

Key features:

  • Unified OpenAI-compatible API with automatic provider routing
  • Fallback mechanisms to maximize uptime
  • $0.10/1M input, $0.50/1M output on paid tier
  • Free variant available with 1M token context window
  • 1M token context window (paid tier)

Conclusion

Provider choice for Nemotron 3 Super depends on what your workload actually optimizes for:

  • Production deployments at scale: DeepInfra — lowest blended cost, full function calling, private endpoints
  • Interactive and latency-critical apps: Baseten (0.56s TTFT) or CoreWeave (0.98s TTFT, lowest blended in AA benchmark set)
  • High-volume batch processing: Lightning AI — 509.3 t/s output speed
  • Complex agentic workflows needing JSON + function calling: Nebius
  • AWS enterprise integration: Amazon Bedrock — fully managed, compliant, cross-region
  • Flexible self-hosted or dedicated GPU: Qubrid AI
  • Async batch workloads: Doubleword
  • High availability with routing fallback: OpenRouter

For most production-scale deployments, DeepInfra is the strongest starting point: lowest blended price, full API feature support, and the infrastructure reliability that comes with bare-metal deployment. The API benchmarks for Nemotron 3 Super and the Nemotron 3 Nano explainer are useful companion reads when evaluating the full Nemotron family.

Related articles
Art That Talks Back: A Hands-On Tutorial on Talking ImagesArt That Talks Back: A Hands-On Tutorial on Talking ImagesTurn any image into a talking masterpiece with this step-by-step guide using DeepInfra’s GenAI models.
Kimi K2.6 API Benchmarks: Latency, TPS & Cost Analysis (2026)Kimi K2.6 API Benchmarks: Latency, TPS & Cost Analysis (2026)<p>About Kimi K2.6 Kimi K2.6 is an open-source frontier model from Moonshot AI, released on April 20, 2026. It is a native multimodal agentic model built for long-horizon coding, autonomous execution, and swarm-based task orchestration. The model uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters per token, using [&hellip;]</p>
Kimi K2.6 Model Overview: Architecture, Features & CapabilitiesKimi K2.6 Model Overview: Architecture, Features & Capabilities<p>Kimi K2.6 is Moonshot AI&#8217;s latest flagship open-source model, released on April 20, 2026 under a Modified MIT license. It is a native multimodal agentic model built on a 1-trillion parameter Mixture-of-Experts (MoE) architecture, with 32 billion parameters activated per token. The model is designed for long-horizon coding, autonomous execution, and multi-agent orchestration, and is [&hellip;]</p>