We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

DeepInfra Raises $107M Series B to Scale Inference Infrastructure

Published on 2026.05.04 by Yessen Kanapin

We've Raised $107M to Build the Inference Cloud the AI Era Actually Needs

Today we're announcing $107 million in Series B funding to scale DeepInfra's inference cloud and expand our global capacity. The round is co-led by 500 Global and Georges Harik, with participation from A.Capital Ventures, Crescent Cove, Felicis, NVIDIA, Peak6, Samsung Next, Supermicro, and Upper90.

This is a big moment for our team — but more than that, it's a signal about where AI infrastructure is heading. Since our Series A, we've grown the volume of tokens we process by 25x.

Inference is the new bottleneck

When we started DeepInfra nearly four years ago, we had a conviction that wasn't yet obvious: inference, not training, would become the dominant driver of enterprise AI workloads. We're now squarely at that inflection point.

Two shifts are colliding at once. Open-source models are reaching parity with proprietary systems, unlocking a new wave of innovation at a fraction of the cost. And agent-based systems are driving continuous, high-volume token demand — a single agentic task can require 50 to 100+ model calls and run nonstop.

Inference is no longer a thin layer on top of an AI stack. It's the system constraint that will define the majority of workloads. And most cloud platforms simply weren't built for this always-on, distributed reality. That's why we built DeepInfra from the ground up — for better economics, performance, and security on inference workloads specifically.

Inference demands its own stack

Serving inference well isn't just a software problem, and it isn't just a hardware problem. It's a full-stack problem. Sustained, high-throughput, low-latency inference requires specialized hardware, purpose-built networking, and inference-optimized software working in concert. General-purpose cloud infrastructure — designed for a mix of workloads with bursty, unpredictable patterns — leaves performance and cost on the table when applied to always-on token generation.

That's the gap DeepInfra was built to close. We co-design across all three layers so the stack behaves predictably under the kinds of workloads agentic AI actually produces.

How DeepInfra is built differently

Our approach comes from years of building and operating distributed systems at global scale (the team behind DeepInfra also built imo, the messenger app used by 200M+ people worldwide). A few things make our platform distinct:

Purpose-built and vertically integrated. We own and operate our GPU infrastructure across eight U.S. data centers, with more locations rolling out globally. Owning the stack from chips to APIs gives us structurally better efficiency and more predictable latency than hyperscalers relying on spot or rented capacity.

Designed for the agentic era. Continuous, high-volume token generation isn't an edge case for us — it's the baseline workload we optimize for.

Collaboration with NVIDIA. We're an early infrastructure collaborator in NVIDIA's open AI ecosystem, supporting Nemotron models, the NemoClaw agent framework, and NVIDIA Dynamo inference software. Early deployment of Blackwell GPUs and upcoming Vera Rubin with Dynamo is unlocking up to 20x improvements in inference cost efficiency.

Enterprise-ready by default. 150+ open-source models through OpenAI-compatible APIs, zero data retention, SOC 2 and ISO 27001 certified — production-grade from day one.

What's next

This funding will accelerate three things: expanding our global compute capacity, deepening our developer tooling, and supporting the next generation of open-source and agentic models as they ship.

We're grateful to our investors for backing this thesis, and to the developers, scaleups, and enterprises building on DeepInfra. Production-grade inference is becoming the decisive variable in enterprise AI deployment — and we're just getting started.

If you're building agentic or high-throughput AI workloads, come build with us.

— The DeepInfra Team

Kimi K2.5 API Benchmarks: Latency, Throughput & CostAbout Kimi K2.5 Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 […]

GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep InfraGLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the […]

What Is Google TurboQuant and What Does It Mean for Open Source Inference? - Deep InfraIn late March 2026, Google Research published a paper that got more attention outside of academic circles than most AI research does. TurboQuant, a new compression algorithm for the key-value cache in large language models, landed with enough noise that Cloudflare CEO Matthew Prince called it Google’s DeepSeek moment. The Silicon Valley Pied Piper comparisons […]

View all