We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

How DeepInfra Built on NVIDIA's Inference Stack and Why It Paid Off
Published on 2026.06.30 by Aray Sultanbekova
How DeepInfra Built on NVIDIA's Inference Stack and Why It Paid Off

How DeepInfra Built on NVIDIA's Inference Stack and Why It Paid Off

When we built DeepInfra, we made a deliberate bet on the NVIDIA inference software stack. Not as a hedge — as a conviction. Today, that bet is paying off in ways that are easy to measure.

The stack

DeepInfra runs on Blackwell-generation GPUs including B300s and our inference stack is built on TensorRT-LLM and NVIDIA Dynamo for distributed serving. We use ModelOpt to quantize models to NVFP4 weights, which reduces memory and compute requirements without meaningful accuracy loss.

These aren't just checkboxes. The components of the NVIDIA inference software stack work together — NVFP4 reduces memory pressure. Dynamo handles KV-aware routing and disaggregated prefill/decode. TensorRT-LLM makes sure the kernels are actually optimized for the hardware underneath. When they work together, the result is production economics that go beyond benchmark performance. Read more about it here.

What it looks like in practice: DeepSeek V4

The clearest proof point is what happened when DeepSeek V4 dropped.

We served it in production on day 0. We launched on Hopper first, then measured performance on B300. The result: 4x better performance. A workload that previously required 4×H200 now runs on a single B300 at higher tokens per second.

"That's not a theoretical improvement. That's a direct reduction in infrastructure cost for the same production traffic."

It happened because the software stack — TensorRT-LLM, Dynamo, NVFP4 — was already in place and ready to take advantage of the hardware the moment it was available. A direct line to the NVIDIA team also helped. Getting a new frontier model deployed and serving production traffic at scale from day one requires more than good software, it requires coordination.

What this means for developers building on DeepInfra

As NVIDIA continues to optimize the inference software stack, those improvements flow through to every model on DeepInfra. Developers don't have to do anything. The same API call gets faster and cheaper as the stack compounds underneath.

That's the reason we invested early in this ecosystem and why we keep investing. The flywheel is real.

Get started

Explore the models available on DeepInfra, including DeepSeek V4 Pro on NVIDIA Blackwell, at deepinfra.com.

Related articles
Best API Providers for DeepSeek V4 in 2026Best API Providers for DeepSeek V4 in 2026<p>DeepSeek V4 is available across a range of hosted API providers, each with different pricing, performance, and deployment trade-offs. The model comes in two variants: V4 Pro, a 1.6 trillion total parameter Mixture-of-Experts model with 49 billion active parameters and a 1M token context window, and V4 Flash, a lighter 284B total parameter variant built [&hellip;]</p>
The easiest way to build AI applications with Llama 2 LLMs.The easiest way to build AI applications with Llama 2 LLMs.The long awaited Llama 2 models are finally here! We are excited to show you how to use them with DeepInfra. These collection of models represent the state of the art in open source language models. They are made available by Meta AI and the l...
MiMo-V2.5 Model Documentation and Integration GuideMiMo-V2.5 Model Documentation and Integration Guide<p>MiMo-V2.5 is a native omnimodal model developed by XiaomiMiMo, designed to process and understand text, image, video, and audio through a unified architecture rather than relying on &#8220;bolted-on&#8221; components for each modality. Built on a 310-billion-parameter Sparse Mixture of Experts (MoE) architecture — with only 15 billion parameters activated during inference — MiMo-V2.5 offers a [&hellip;]</p>