DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Today we're announcing $107 million in Series B funding to scale DeepInfra's inference cloud and expand our global capacity. The round is co-led by 500 Global and Georges Harik, with participation from A.Capital Ventures, Crescent Cove, Felicis, NVIDIA, Peak6, Samsung Next, Supermicro, and Upper90.
This is a big moment for our team — but more than that, it's a signal about where AI infrastructure is heading. Since our Series A, we've grown the volume of tokens we process by 25x.
When we started DeepInfra nearly four years ago, we had a conviction that wasn't yet obvious: inference, not training, would become the dominant driver of enterprise AI workloads. We're now squarely at that inflection point.
Two shifts are colliding at once. Open-source models are reaching parity with proprietary systems, unlocking a new wave of innovation at a fraction of the cost. And agent-based systems are driving continuous, high-volume token demand — a single agentic task can require 50 to 100+ model calls and run nonstop.
Inference is no longer a thin layer on top of an AI stack. It's the system constraint that will define the majority of workloads. And most cloud platforms simply weren't built for this always-on, distributed reality. That's why we built DeepInfra from the ground up — for better economics, performance, and security on inference workloads specifically.
Serving inference well isn't just a software problem, and it isn't just a hardware problem. It's a full-stack problem. Sustained, high-throughput, low-latency inference requires specialized hardware, purpose-built networking, and inference-optimized software working in concert. General-purpose cloud infrastructure — designed for a mix of workloads with bursty, unpredictable patterns — leaves performance and cost on the table when applied to always-on token generation.
That's the gap DeepInfra was built to close. We co-design across all three layers so the stack behaves predictably under the kinds of workloads agentic AI actually produces.
Our approach comes from years of building and operating distributed systems at global scale (the team behind DeepInfra also built imo, the messenger app used by 200M+ people worldwide). A few things make our platform distinct:
Purpose-built and vertically integrated. We own and operate our GPU infrastructure across eight U.S. data centers, with more locations rolling out globally. Owning the stack from chips to APIs gives us structurally better efficiency and more predictable latency than hyperscalers relying on spot or rented capacity.
Designed for the agentic era. Continuous, high-volume token generation isn't an edge case for us — it's the baseline workload we optimize for.
Collaboration with NVIDIA. We're an early infrastructure collaborator in NVIDIA's open AI ecosystem, supporting Nemotron models, the NemoClaw agent framework, and NVIDIA Dynamo inference software. Early deployment of Blackwell GPUs and upcoming Vera Rubin with Dynamo is unlocking up to 20x improvements in inference cost efficiency.
Enterprise-ready by default. 150+ open-source models through OpenAI-compatible APIs, zero data retention, SOC 2 and ISO 27001 certified — production-grade from day one.
This funding will accelerate three things: expanding our global compute capacity, deepening our developer tooling, and supporting the next generation of open-source and agentic models as they ship.
We're grateful to our investors for backing this thesis, and to the developers, scaleups, and enterprises building on DeepInfra. Production-grade inference is becoming the decisive variable in enterprise AI deployment — and we're just getting started.
If you're building agentic or high-throughput AI workloads, come build with us.
— The DeepInfra Team
DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis<p>About DeepSeek V4 Pro DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek’s flagship open-weight models. The model introduces a hybrid attention […]</p>
From Precision to Quantization: A Practical Guide to Faster, Cheaper LLMs<p>Large language models live and die by numbers—literally trillions of them. How finely we store those numbers (their precision) determines how much memory a model needs, how fast it runs, and sometimes how good its answers are. This article walks from the basics to the deep end: we’ll start with how computers even store a […]</p>
Introducing Tool Calling with LangChain, Search the Web with Tavily and Tool Calling AgentsIn this blog post, we will query for the details of a recently released expansion pack for Elden Ring, a critically acclaimed game released in 2022, using the Tavily tool with the ChatDeepInfra model.
Using this boilerplate, one can automate the process of searching for information with well-writt...© 2026 DeepInfra. All rights reserved.