DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Nemotron 3 Super 120B is available across a growing number of hosted APIs and deployment platforms. At 120B total parameters with 12B active per inference pass, the right provider matters: latency, throughput, and cost vary significantly depending on where you run it. This guide covers the top options by use case — from fully managed APIs to dedicated GPU deployments and no-code routing layers. For a detailed cost breakdown, see the Nemotron 3 Super pricing guide.
| Best For | Provider |
|---|---|
| Best overall value & cost | DeepInfra |
| Best for interactive applications | CoreWeave |
| Best for latency-critical & voice agents | Baseten |
| Best for high-volume batch processing | Lightning AI |
| Best for complex agentic workflows | Nebius |
| Best for AWS enterprise integration | Amazon Bedrock |
| Best for flexible deployment options | Qubrid AI |
| Best for asynchronous workloads | Doubleword |
| Best for high availability with routing fallback | OpenRouter |
DeepInfra
DeepInfra is the recommended option for most production Nemotron 3 Super deployments. It delivers the lowest blended price in the benchmarked set at $0.20 per 1M tokens, with strong output speed (459.3 t/s), competitive TTFT (1.01s), and full support for function calling. The platform runs on bare-metal infrastructure, is typically 50–80% cheaper than major cloud alternatives, and is SOC 2 and ISO 27001 certified. Public and private endpoint deployment are both available.
Key features:
For a full breakdown of workload cost scenarios on DeepInfra, see the Nemotron 3 Super pricing guide.
CoreWeave
CoreWeave is highlighted in Artificial Analysis benchmarks for offering competitive sub-second TTFT and low blended pricing. It is a strong fit for real-time inference and cost-sensitive workloads where rapid first response matters.
Key features:
Baseten
Baseten is purpose-built for latency-critical applications. Its 0.56s TTFT is the fastest measured across benchmarked providers — a meaningful advantage for voice-to-voice agents or any interface where perceived responsiveness depends on getting a first response quickly.
Key features:
Lightning AI
Lightning AI leads the benchmarked set on raw output speed at 509.3 t/s — the right choice when sustained generation throughput is the primary constraint, such as high-volume batch processing or document generation pipelines.
Key features:
Nebius
Nebius provides full support for both JSON mode and function calling at high output speeds, making it a solid fit for developers building structured, multi-step agentic workflows that require reliable tool orchestration.
Key features:
Amazon Bedrock
Amazon Bedrock added Nemotron 3 Super on March 18, 2026, providing fully managed access through a single AWS API — no infrastructure to provision. It is the natural choice for enterprise teams already operating within the AWS ecosystem who need compliance, cross-region routing, and flexible service tiers.
Key features:
Qubrid AI
Qubrid AI offers a range of deployment options from simple serverless API access to dedicated GPU VMs and Kubernetes deployments, bridging the gap between managed inference and custom infrastructure.
Key features:
Doubleword
Doubleword focuses on workload flexibility with distinct pricing tiers and a batch processing API for asynchronous inference — useful for teams that want to optimize cost by decoupling generation from real-time latency requirements.
Key features:
OpenRouter
OpenRouter is a unified API routing layer that provides access to Nemotron 3 Super through automatic provider routing and fallback mechanisms. It also offers a free variant (nvidia/nemotron-3-super-120b-a12b:free) with a 1M context window, useful for non-production testing. Current pricing on the paid tier: $0.10/1M input, $0.50/1M output.
Key features:
Provider choice for Nemotron 3 Super depends on what your workload actually optimizes for:
For most production-scale deployments, DeepInfra is the strongest starting point: lowest blended price, full API feature support, and the infrastructure reliability that comes with bare-metal deployment. The API benchmarks for Nemotron 3 Super and the Nemotron 3 Nano explainer are useful companion reads when evaluating the full Nemotron family.
Art That Talks Back: A Hands-On Tutorial on Talking ImagesTurn any image into a talking masterpiece with this step-by-step guide using DeepInfra’s GenAI models.
Kimi K2.6 API Benchmarks: Latency, TPS & Cost Analysis (2026)<p>About Kimi K2.6 Kimi K2.6 is an open-source frontier model from Moonshot AI, released on April 20, 2026. It is a native multimodal agentic model built for long-horizon coding, autonomous execution, and swarm-based task orchestration. The model uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters per token, using […]</p>
Kimi K2.6 Model Overview: Architecture, Features & Capabilities<p>Kimi K2.6 is Moonshot AI’s latest flagship open-source model, released on April 20, 2026 under a Modified MIT license. It is a native multimodal agentic model built on a 1-trillion parameter Mixture-of-Experts (MoE) architecture, with 32 billion parameters activated per token. The model is designed for long-horizon coding, autonomous execution, and multi-agent orchestration, and is […]</p>
© 2026 DeepInfra. All rights reserved.