We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

DeepInfra Raises $107M Series B to Scale Inference Infrastructure
Published on 2026.05.04 by Yessen Kanapin
DeepInfra Raises $107M Series B to Scale Inference Infrastructure

We've Raised $107M to Build the Inference Cloud the AI Era Actually Needs

Today we're announcing $107 million in Series B funding to scale DeepInfra's inference cloud and expand our global capacity. The round is co-led by 500 Global and Georges Harik, with participation from A.Capital Ventures, Crescent Cove, Felicis, NVIDIA, Peak6, Samsung Next, Supermicro, and Upper90.

This is a big moment for our team — but more than that, it's a signal about where AI infrastructure is heading. Since our Series A, we've grown the volume of tokens we process by 25x.

Inference is the new bottleneck

When we started DeepInfra nearly four years ago, we had a conviction that wasn't yet obvious: inference, not training, would become the dominant driver of enterprise AI workloads. We're now squarely at that inflection point.

Two shifts are colliding at once. Open-source models are reaching parity with proprietary systems, unlocking a new wave of innovation at a fraction of the cost. And agent-based systems are driving continuous, high-volume token demand — a single agentic task can require 50 to 100+ model calls and run nonstop.

Inference is no longer a thin layer on top of an AI stack. It's the system constraint that will define the majority of workloads. And most cloud platforms simply weren't built for this always-on, distributed reality. That's why we built DeepInfra from the ground up — for better economics, performance, and security on inference workloads specifically.

Inference demands its own stack

Serving inference well isn't just a software problem, and it isn't just a hardware problem. It's a full-stack problem. Sustained, high-throughput, low-latency inference requires specialized hardware, purpose-built networking, and inference-optimized software working in concert. General-purpose cloud infrastructure — designed for a mix of workloads with bursty, unpredictable patterns — leaves performance and cost on the table when applied to always-on token generation.

That's the gap DeepInfra was built to close. We co-design across all three layers so the stack behaves predictably under the kinds of workloads agentic AI actually produces.

How DeepInfra is built differently

Our approach comes from years of building and operating distributed systems at global scale (the team behind DeepInfra also built imo, the messenger app used by 200M+ people worldwide). A few things make our platform distinct:

Purpose-built and vertically integrated. We own and operate our GPU infrastructure across eight U.S. data centers, with more locations rolling out globally. Owning the stack from chips to APIs gives us structurally better efficiency and more predictable latency than hyperscalers relying on spot or rented capacity.

Designed for the agentic era. Continuous, high-volume token generation isn't an edge case for us — it's the baseline workload we optimize for.

Collaboration with NVIDIA. We're an early infrastructure collaborator in NVIDIA's open AI ecosystem, supporting Nemotron models, the NemoClaw agent framework, and NVIDIA Dynamo inference software. Early deployment of Blackwell GPUs and upcoming Vera Rubin with Dynamo is unlocking up to 20x improvements in inference cost efficiency.

Enterprise-ready by default. 150+ open-source models through OpenAI-compatible APIs, zero data retention, SOC 2 and ISO 27001 certified — production-grade from day one.

What's next

This funding will accelerate three things: expanding our global compute capacity, deepening our developer tooling, and supporting the next generation of open-source and agentic models as they ship.

We're grateful to our investors for backing this thesis, and to the developers, scaleups, and enterprises building on DeepInfra. Production-grade inference is becoming the decisive variable in enterprise AI deployment — and we're just getting started.

If you're building agentic or high-throughput AI workloads, come build with us.

— The DeepInfra Team

Related articles
Best Kimi K2.6 API Providers for Developers (2026)Best Kimi K2.6 API Providers for Developers (2026)<p>Kimi K2.6 is available across a range of hosted API providers, and the right choice depends on what your workload optimizes for — latency, throughput, cost, deployment flexibility, or native feature support. This guide covers the top options by use case. For a detailed cost breakdown across workload types, see the Kimi K2.6 pricing guide. [&hellip;]</p>
Qwen3.5 0.8B API Benchmarks: Latency, Throughput & CostQwen3.5 0.8B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 0.8B (Reasoning) Qwen3.5 0.8B is part of Alibaba Cloud&#8217;s Qwen3.5 Small Model Series, released on March 2, 2026. Designed under the philosophy of &#8220;More Intelligence, Less Compute,&#8221; it targets edge devices, mobile phones, and low-latency applications where battery life and memory constraints are critical. It employs an Efficient Hybrid Architecture combining Gated Delta [&hellip;]</p>
GLM-5 API Benchmarks: Latency, Throughput & CostGLM-5 API Benchmarks: Latency, Throughput & Cost<p>GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high &#8220;thinking token&#8221; usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5&#8217;s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ [&hellip;]</p>