We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

How DeepInfra Built on NVIDIA's Inference Stack and Why It Paid Off

Published on 2026.06.30 by Aray Sultanbekova

How DeepInfra Built on NVIDIA's Inference Stack and Why It Paid Off

When we built DeepInfra, we made a deliberate bet on the NVIDIA inference software stack. Not as a hedge — as a conviction. Today, that bet is paying off in ways that are easy to measure.

The stack

DeepInfra runs on Blackwell-generation GPUs including B300s and our inference stack is built on TensorRT-LLM and NVIDIA Dynamo for distributed serving. We use ModelOpt to quantize models to NVFP4 weights, which reduces memory and compute requirements without meaningful accuracy loss.

These aren't just checkboxes. The components of the NVIDIA inference software stack work together — NVFP4 reduces memory pressure. Dynamo handles KV-aware routing and disaggregated prefill/decode. TensorRT-LLM makes sure the kernels are actually optimized for the hardware underneath. When they work together, the result is production economics that go beyond benchmark performance. Read more about it here.

What it looks like in practice: DeepSeek V4

The clearest proof point is what happened when DeepSeek V4 dropped.

We served it in production on day 0. We launched on Hopper first, then measured performance on B300. The result: 4x better performance. A workload that previously required 4×H200 now runs on a single B300 at higher tokens per second.

"That's not a theoretical improvement. That's a direct reduction in infrastructure cost for the same production traffic."

It happened because the software stack — TensorRT-LLM, Dynamo, NVFP4 — was already in place and ready to take advantage of the hardware the moment it was available. A direct line to the NVIDIA team also helped. Getting a new frontier model deployed and serving production traffic at scale from day one requires more than good software, it requires coordination.

What this means for developers building on DeepInfra

As NVIDIA continues to optimize the inference software stack, those improvements flow through to every model on DeepInfra. Developers don't have to do anything. The same API call gets faster and cheaper as the stack compounds underneath.

That's the reason we invested early in this ecosystem and why we keep investing. The flywheel is real.

Get started

Explore the models available on DeepInfra, including DeepSeek V4 Pro on NVIDIA Blackwell, at deepinfra.com.

Introducing GLM-5.2 on DeepInfra<p>GLM-5.2 is Z-AI’s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding […]</p>

Deploy Custom LLMs on DeepInfraDid you just finetune your favorite model and are wondering where to run it? Well, we have you covered. Simple API and predictable pricing. Put your model on huggingface Use a private repo, if you wish, we don't mind. Create a hf access token just for the repo for better security. Create c...

Qwen3.5 27B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 27B (Reasoning) Qwen3.5 27B is part of Alibaba Cloud’s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench […]</p>

View all