DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Kimi K2.5 is positioned as Moonshot AI’s “do-it-all” model for modern product workflows: native multimodality (text + vision/video), Instant vs. Thinking modes, and support for agentic / multi-agent (“swarm”) execution patterns. In real applications, though, model capability is only half the story. The provider’s inference stack determines the things your users actually feel: time-to-first-token (TTFT), tokens/second while streaming, and how costs behave when you scale concurrency or push long contexts.
This article compares the main Kimi K2.5 API providers tracked by ArtificialAnalysis and explains what those metrics mean in practice—so you can pick the best provider for your workload.
Moonshot describes K2.5 as its most versatile model to date, emphasizing:
Practical tip: default to Instant/non-thinking for everyday UX, then enable Thinking selectively for hard refactors, long reasoning chains, or when the cost of a wrong answer is high. For the rest of this article, we will especially look at the reasoning model.
For Kimi K2.5—especially in reasoning / “thinking” mode—speed isn’t one number. It’s two separate behaviors that shape how your product feels:
In Artificial Analysis’ test using 1,000 input tokens, DeepInfra is the fastest provider at 0.31s, narrowly ahead of Together.ai (0.32s). Fireworks follows at 0.46s, then Parasail at 0.58s, and GMI at 0.84s. First-party Kimi (Moonshot direct) starts significantly later at 1.46s, and Novita is slowest at 1.79s. The takeaway is simple: DeepInfra delivers the snappiest “first response” for Kimi K2.5, which is especially valuable for streaming UIs, IDE copilots, and agent loops where TTFT is paid repeatedly across many steps.
Why that matters for reasoning mode: K2.5 reasoning runs often produces long, multi-part outputs. Users don’t mind waiting for the full completion if they see immediate progress. Sub-0.5s TTFT is a UX unlock for:
DeepInfra’s 0.31s TTFT is exactly what you want when you’re running K2.5 as an interactive agent: you get a near-immediate stream start, even when the prompt is already sizeable.
On the output-speed chart, DeepInfra sits in the upper middle tier on raw throughput—fast enough to handle long reasoning outputs without the “waiting forever” feel, while still pairing that with the best-in-class TTFT.
To translate those numbers into something tangible, assume your app streams a 1,000 output-token reasoning answer:
So if your only KPI is “fastest full completion,” Fireworks wins on this particular snapshot. But reasoning UX is not just completion time—it’s how quickly the model starts responding and how well it supports interactive iteration. This is where DeepInfra’s lead TTFT matters disproportionately:
Net: DeepInfra gives K2.5 reasoning a “fast-start” feel at scale—and that’s often the difference between an agent that feels interactive vs. one that feels sluggish, even if peak tokens/sec isn’t the highest on the chart.
Kimi K2.5 providers generally price per 1M tokens, split into input (everything you send: system prompt, tools schema, retrieved docs, code) and output (everything the model generates). In real K2.5 workloads—coding assistants, agent reports, long “reasoning” write-ups—output tokens often dominate spend, but input cost still compounds quickly once you start doing long-context prompts or multi-step agent loops.
In the Artificial Analysis pricing snapshot, DeepInfra offers one of the most balanced price points:
DeepInfra matches the lowest input tier shown ($0.50/M), while keeping output pricing below the $3.00/M cluster at $2.80/M. That combination is particularly strong for the “real” K2.5 use cases—large prompts plus substantial reasoning output—because it reduces both the prompt tax (when context gets big) and the completion tax (when answers get long). Put simply: DeepInfra avoids the expensive $3.00/M output tier, without pushing you into a higher input tier.
Putting the benchmark and pricing signals together, DeepInfra is a highly pragmatic default for production Kimi K2.5—especially for interactive, tool-driven “reasoning” experiences where perceived latency and cost stability matter more than peak throughput on a single long completion.
For teams building:
…DeepInfra typically delivers the best “production feel”: fastest time-to-first-token, strong throughput, and pricing that stays competitive even as prompts and outputs scale.
Based on the provider comparison metrics available from ArtificialAnalysis, DeepInfra is a strong default choice for production Kimi K2.5 deployments—especially for interactive, reasoning-first applications. It delivers the fastest time-to-first-token (0.31s) in the snapshot, which is the metric users feel most in streaming UIs and multi-step agent loops, and it pairs that responsiveness with competitive throughput (~81 tokens/sec). On pricing, DeepInfra sits in a compelling middle ground at $0.50/M input and $2.80/M output, avoiding the $3.00/M output tier seen with several alternatives while staying in the lowest input band shown.
If your primary KPI is maximum tokens/sec on very long single completions, providers like Fireworks can look attractive on throughput alone—but many real-world systems pay TTFT repeatedly across steps, tools, and retries, where DeepInfra’s “fast-start” behavior compounds into a better overall experience. And if your architecture can consistently achieve high cache-hit rates and you’re optimizing for cached input economics specifically, Moonshot’s direct Kimi API remains worth benchmarking in your own workload profile.
DeepSeek V4 Pro Pricing Guide 2026: Pricing, Providers & Cost Comparison<p>DeepSeek V4 Pro matters because it pushes two levers developers actually care about at the same time: open-weight availability and a very competitive provider market. As of the research here, DeepSeek V4 Pro Max is tracked across six API providers, and five of them cluster at the same blended price of $2.17 per 1M tokens […]</p>
Step 3.5 Flash API Benchmarks: Latency, Throughput & Cost<p>About Step 3.5 Flash Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models. […]</p>
DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis<p>About DeepSeek V4 Pro DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek’s flagship open-weight models. The model introduces a hybrid attention […]</p>
© 2026 DeepInfra. All rights reserved.