DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through complex problems.
M2.5 was trained extensively with reinforcement learning across more than 200,000 real-world environments, covering over 10 programming languages including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby. A notable characteristic is its spec-writing tendency — before writing any code, M2.5 actively decomposes and plans features, structure, and UI design from the perspective of an experienced software architect.
The model achieves industry-leading benchmark scores: 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp. It completes SWE-Bench Verified evaluations 37% faster than its predecessor M2.1 while consuming fewer tokens. Beyond coding, M2.5 excels at office productivity tasks including generating formatted Word documents, PowerPoint presentations, and Excel spreadsheets with working formulas. The model weights are fully open-sourced on HuggingFace under a Modified MIT License.
MiniMax-M2.5 is now available across multiple inference providers, but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Why It’s Best | Blended ($/1M) | Input ($/1M) | Output ($/1M) | Latency (TTFT) | Speed (t/s) | E2E (s/500 tok) |
|---|---|---|---|---|---|---|---|
| DeepInfra (FP8) | Best value + low latency balance; cheapest output tokens; strong for cost-sensitive apps needing snappy first-token response | $0.44 | $0.27 | $0.95 | 0.56s | 66 | 38.64s |
| SiliconFlow (FP8) | Lowest blended price overall (budget-first) | $0.40 | $0.20 | $1.00 | 1.90s | 85 | 31.47s |
| Together.ai (FP4) | Lowest latency (interactive-first) | $0.53 | — | — | 0.42s | 95 | 26.80s |
| Fireworks | Very high throughput (speed-first) | $0.53 | — | — | 0.76s | 193 | 13.71s |
| Clarifai | Strong low-latency + good speed combination | $0.53 | — | — | 0.54s | 143 | 18.07s |
| SambaNova | Fastest output speed overall (throughput-max) | $0.53 | — | — | 1.60s | 395 | 7.93s |
Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale MiniMax-M2.5 deployment. It offers the best balance of low latency, competitive pricing, and full feature support. For use cases requiring maximum throughput, SambaNova leads the field. For absolute lowest latency, Together.ai is the fastest provider tested.
DeepInfra emerges as the superior choice for production workloads, offering the most robust balance of low latency, competitive pricing, and feature completeness. While other providers may win in a single metric, DeepInfra consistently scores in the top tier across all critical categories without significant trade-offs.
DeepInfra utilizes FP8 quantization to deliver a TTFT of 0.56 seconds — nearly indistinguishable from the fastest provider (Together.ai at 0.42s) for human perception. Crucially, it achieves this while being significantly cheaper ($0.44 vs $0.53 per 1M tokens) than the majority of the market.
Unlike the fastest throughput providers (SambaNova) which suffer from high latency (1.60s), DeepInfra maintains a snappy interactive feel. For developers building RAG applications or agents requiring tool use, DeepInfra’s combination of low latency, sub-$0.50 pricing, and full tool-calling support makes it the definitive all-rounder.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_DEEPINFRA_KEY",
base_url="https://api.deepinfra.com/v1/openai"
)
response = client.chat.completions.create(
model="minimax/minimax-m2.5",
messages=[{"role": "user", "content": "Explain quantum entanglement."}],
)If your use case involves batch processing or generating long-form content where the start time matters less than the completion time, SambaNova is the undisputed leader in throughput.
SambaNova’s architecture delivers 394.6 t/s — roughly 2x faster than the second-fastest provider (Fireworks) and 13x faster than the slowest. SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput at the cost of higher initial latency.
This speed comes with trade-offs. SambaNova has the lowest context window in the benchmark set (164k vs the standard ~200k) and a relatively high TTFT of 1.60s. This makes it ideal for background generation tasks but less suitable for real-time conversational interfaces.
For real-time chat applications where perceived speed is defined by how quickly the first word appears, Together.ai takes the lead.
Utilizing aggressive FP4 quantization, Together.ai achieves the lowest latency in the field at 0.42s. However, this comes at a premium price point ($0.53) compared to budget options. Its output speed of 95 t/s is also significantly slower than the top throughput providers — it is the fastest to start, but not the fastest to finish large generations.
For developers operating on tight margins or processing massive volumes of non-time-sensitive data, SiliconFlow offers the absolute lowest floor price.
SiliconFlow is the most affordable provider analyzed, undercutting the standard market rate by roughly 24%. However, this cost saving comes with a latency trade-off. With a TTFT of 1.90s, it has one of the slowest response times in the benchmark — nearly 4x slower than DeepInfra. It is an excellent choice for offline batch jobs but is not recommended for user-facing applications.
Fireworks serves as a strong alternative for those who need high speed but cannot tolerate the high latency of SambaNova.
Fireworks holds the #2 spot for output speed (193.1 t/s) while maintaining a respectable sub-second latency (0.76s). It bridges the gap between the speed leaders and the latency leaders. However, at $0.53 per 1M tokens, it is notably more expensive than DeepInfra without offering the same low-latency benefits.
| Provider | Optimization | Input Price | Output Price | Latency (TTFT) | Speed (t/s) | Context | JSON / Tools |
|---|---|---|---|---|---|---|---|
| DeepInfra | Balanced (Recommended) | $0.27 | $0.95 | 0.56s | 66 | 197k | Yes / Yes |
| SambaNova | Throughput | $0.50 | $0.55 | 1.60s | 394.6 | 164k | Yes / Yes |
| Together.ai | Latency | $0.50 | $0.55 | 0.42s | 95 | 197k | Yes / Yes |
| SiliconFlow | Cost | $0.20 | $1.00 | 1.90s | 85 | 197k | Yes / Yes |
| Fireworks | Speed / Hybrid | $0.50 | $0.55 | 0.76s | 193.1 | 197k | Yes / Yes |
| Clarifai | Hybrid | $0.50 | $0.55 | 0.54s | 142.6 | 205k | Yes / Yes |
| MiniMax Direct | Native | $0.50 | $0.55 | 3.23s | 49 | 205k | No / Yes |
DeepInfra is the best choice for RAG due to its low TTFT (0.56s) and full support for JSON Mode and Function Calling, which are essential for retrieving and formatting context.
Yes, the model supports function calling, but not all providers enable it. DeepInfra, Together.ai, and Fireworks support full tool-use, while the MiniMax Direct API currently has limited support.
SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput (394 t/s) at the cost of higher initial latency.
Yes. MiniMax-M2.5 achieves state-of-the-art performance in programming evaluations, scoring 80.2% on SWE-Bench Verified. The model was trained on over 10 languages across more than 200,000 real-world environments and excels at the entire development lifecycle — from system design to code review.
DeepInfra offers better value at $0.44/1M tokens vs Together.ai’s $0.53/1M tokens. Together.ai has slightly lower latency (0.42s vs 0.56s), but for most applications, this 140ms difference is imperceptible to users. DeepInfra is the recommended choice unless sub-half-second latency is absolutely critical.
For the vast majority of MiniMax-M2.5 implementations, DeepInfra is the logical choice. It provides a premium low-latency experience (0.56s) usually reserved for more expensive providers, while maintaining a near-bottom-tier price point ($0.44). While SambaNova is technically superior for pure bulk text generation, DeepInfra’s versatility across RAG, agents, and chat interfaces makes it the standout provider in this benchmark.
Kimi K2 0905 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2 0905 Kimi K2 0905 is a state-of-the-art large language model developed by Moonshot AI, representing a significant advancement in open-weight AI capabilities. This Mixture-of-Experts (MoE) model features 1 trillion total parameters with 32 billion activated parameters per forward pass, making it highly efficient while maintaining frontier-level performance. The model supports a 256k […]</p>
Pricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is […]</p>
GLM-4.7-Flash API Benchmarks: Latency, Throughput & Cost<p>About GLM-4.7-Flash GLM-4.7-Flash is Z.AI’s open-weights reasoning model released in January 2026. Built on a Mixture-of-Experts (MoE) Transformer architecture, it features 30 billion total parameters with only ~3 billion active per inference — making it exceptionally efficient for its capability class. The model is designed as a lightweight, cost-effective alternative to Z.AI’s flagship GLM-4.7, optimized […]</p>
© 2026 DeepInfra. All rights reserved.