DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters.
Kimi K2.5 operates in both “Thinking” and “Instant” modes, allowing developers to toggle between deep chain-of-thought reasoning and faster, direct responses. The model supports a 256K token context window and excels in visual knowledge, cross-modal reasoning, and agentic tool use. One of its standout capabilities is “Agent Swarm” technology, which enables the model to decompose complex tasks into parallel sub-tasks executed by dynamically instantiated, domain-specific agents.
On benchmarks, Kimi K2.5 has set state-of-the-art records on Humanity’s Last Exam (HLE), BrowseComp, and other agentic benchmarks, achieving 50.2% on HLE with tools, 96.1% on AIME 2025, and 76.8% on SWE-Bench Verified.
Kimi K2.5 is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Best For | Blended ($/1M) | Input ($/1M) | Output ($/1M) | Speed (t/s) | Latency (TTFT) | Why Notable |
|---|---|---|---|---|---|---|---|
| DeepInfra | Lowest cost / scale-out workloads | $0.90 | $0.45 | $2.25 | 66 | 1.06s | Best unit economics — lowest blended, input, and output pricing. Ideal for batch, large-context, and cost-sensitive production. |
| DeepInfra Turbo | Cost-aware speed upgrade | $1.20 | — | — | 334 | 0.69s | Pay a bit more, get far more speed — while staying in the mainstream price band. |
| Nebius Fast | Low cost + high speed | $1.00 | $0.50 | $2.50 | 338 | 1.86s | Fast throughput near top tier while staying close to the low-price floor. |
| Together.ai | Maximum throughput | $1.07 | $0.50 | — | 431.1 | 1.37s | Fastest output speed measured; good for throughput-first systems at a still-competitive price. |
| Baseten | Lowest latency | $1.20 | — | — | 334 | 0.40s | Best TTFT for interactive UX, though at higher blended price than DeepInfra. |
Based on benchmarks across 17 tracked providers, DeepInfra is the recommended API for production-scale Kimi K2.5 deployment. It offers the market’s lowest price ($0.90/1M) for background tasks and a high-performance Turbo tier ($1.20/1M) that rivals the fastest competitors in throughput and latency. For maximum throughput, Together.ai leads at 431.1 t/s. For the lowest latency, Baseten delivers a best-in-class 0.40s TTFT.
Best for: Cost efficiency and flexible performance tiers.
DeepInfra secures the top spot by offering a bifurcated service model that caters to both cost-sensitive batch processing and high-performance interactive applications. It is currently the most affordable provider on the market.
At $0.90 per 1M tokens, DeepInfra is the cheapest option available, undercutting the closest competitors (Nebius Fast and Parasail) by 10%. The Turbo tier jumps to 334 tokens/sec with a latency of 0.69s, giving developers the flexibility to use the Standard tier for background reasoning tasks and the Turbo tier for user-facing applications — all within the same ecosystem.
Important: While DeepInfra Standard supports Function Calling, DeepInfra Turbo does not currently list this feature. Developers requiring tool use should select the Standard endpoint or verify recent updates.
Best for: High-volume text generation and long-context reasoning.
If raw generation speed is the primary KPI, Together.ai is the market leader. Kimi K2.5 is a reasoning model, meaning it generates “thinking” tokens before the final answer — high output speed is critical to reducing total wait time.
Together.ai clocks in at 431.1 t/s — approximately 14.3x faster than the slowest provider (SiliconFlow). It outperforms the second-fastest provider, Eigen AI, by a margin of ~7 t/s. Despite this premium speed, its pricing ($1.07) remains highly competitive, sitting only slightly above the $1.00 budget tier.
Best for: Real-time chatbots and interactive agents.
For applications where the perceived speed (Time to First Token) is more important than total generation time, Baseten offers the most responsive infrastructure.
Baseten achieves a remarkable 0.40s TTFT — significantly faster than the average provider, beating the runner-up FriendliAI (0.52s) by 120ms. It maintains a high output speed of 334 t/s (identical to DeepInfra Turbo), ensuring that once the first token appears, the rest of the response follows rapidly.
Best for: A balance of speed and pricing.
Nebius Fast offers a compelling sweet spot between the extreme speed of Together.ai and the extreme economy of DeepInfra.
Nebius Fast matches DeepInfra Turbo’s throughput (~338 t/s) but at a lower price point ($1.00 vs $1.20). However, it suffers in latency metrics with a TTFT of 1.86s — nearly 4.5x slower than Baseten. It is an excellent choice for non-interactive workloads where throughput per dollar is the primary metric.
| Provider | Output Speed (t/s) | Latency (TTFT) |
|---|---|---|
| Together.ai | 431.1 | 1.37s |
| Eigen AI | 423.7 | 1.14s |
| Clarifai | 370.7 | 0.74s |
| Fireworks | 353.7 | 0.62s |
| DeepInfra Turbo | 334.0 | 0.69s |
| Provider | Blended Price (/1M) | Input Price | Output Price |
|---|---|---|---|
| DeepInfra | $0.90 | $0.45 | $2.25 |
| Nebius Fast | $1.00 | $0.50 | $2.50 |
| Parasail | $1.00 | N/A | N/A |
| Clarifai | $1.07 | N/A | $2.50 |
| Together.ai | $1.07 | $0.50 | N/A |
Technical integration is just as important as raw speed.
Most providers hosting Kimi K2.5 utilize OpenAI-compatible endpoints. Here is how to configure your client for DeepInfra:
import os
from openai import OpenAI
# Configuration for DeepInfra (Best Value)
client = OpenAI(
base_url="https://api.deepinfra.com/v1/openai",
api_key=os.environ.get("DEEPINFRA_API_KEY"),
)
response = client.chat.completions.create(
model="moonshotai/kimi-k2.5-reasoning",
messages=[{"role": "user", "content": "Explain quantum entanglement."}],
stream=True
)Note: When using Kimi K2.5, “Reasoning Tokens” are billed as output tokens. Ensure your max_tokens limit accounts for the internal chain-of-thought process.
No. While the model supports it natively, DeepInfra Turbo does not currently support function calling, whereas DeepInfra Standard, Together.ai, and Baseten do.
Kimi K2.5 generally offers higher throughput on equivalent hardware, though DeepSeek R1 remains cheaper on legacy providers. Kimi’s advantage lies in its 262k context window and native multimodal capabilities.
Standard operates at ~66 t/s and costs $0.90/1M. Turbo operates at ~334 t/s and costs $1.20/1M. Use Standard for batch jobs and Turbo for live applications.
Reasoning models like Kimi K2.5 generate internal “thinking” tokens before producing the final answer. These reasoning tokens are billed as output tokens. The prices listed in this benchmark include reasoning output tokens.
Kimi K2.5 supports a 256K–262K token context window depending on the provider configuration.
For the majority of developers, DeepInfra is the superior choice for Kimi K2.5. It offers the market’s lowest price ($0.90/1M) for background tasks and a high-performance Turbo tier ($1.20/1M) that rivals the fastest competitors in throughput and latency.
Step 3.7 Flash is Live on DeepInfra: An Agentic, Multimodal Model Built for ProductionStepFun's Step 3.7 Flash is now live on DeepInfra. It's a 198B-parameter sparse MoE vision-language model with just ~11B active parameters per token, a 256K context window, and three selectable reasoning levels—purpose-built for high-throughput agentic workflows that combine perception, search, and reasoning.
NVIDIA Nemotron 3 Super on DeepInfra: 120B MoE Model<p>NVIDIA’s Nemotron 3 Super runs 120 billion parameters while activating only 12 billion per token — a ratio that makes a real difference when orchestrating multiple agents in parallel. It’s built on a novel architecture called LatentMoE, a hybrid of Mamba-2, Mixture-of-Experts, and Attention layers designed from the ground up for agentic, reasoning, and long-context […]</p>
OpenClaw Security: Prevent Prompt Injection & Supply Chain Attacks<p>In early 2026, the China’s Ministry of Industry and Information Technology issued an emergency warning about an AI agent runtime that had quietly grown to 135,000 GitHub stars. By mid-February, security researchers were tracking a coordinated campaign called ClawHavoc. The Moltbook breach had exposed customer email archives from 41 enterprises. OpenClaw’s maintainers had shipped three […]</p>
© 2026 DeepInfra. All rights reserved.