MiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About MiniMax-M2.5

MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through complex problems.

M2.5 was trained extensively with reinforcement learning across more than 200,000 real-world environments, covering over 10 programming languages including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby. A notable characteristic is its spec-writing tendency — before writing any code, M2.5 actively decomposes and plans features, structure, and UI design from the perspective of an experienced software architect.

The model achieves industry-leading benchmark scores: 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp. It completes SWE-Bench Verified evaluations 37% faster than its predecessor M2.1 while consuming fewer tokens. Beyond coding, M2.5 excels at office productivity tasks including generating formatted Word documents, PowerPoint presentations, and Excel spreadsheets with working formulas. The model weights are fully open-sourced on HuggingFace under a Modified MIT License.

MiniMax-M2.5 is now available across multiple inference providers, but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

MiniMax-M2.5 API Review Summary

Open weights model (released February 2026) with multiple tracked API providers.
Benchmarks reflect sustained performance: median (P50) over the past 72 hours, using a default workload of 10,000 input tokens.
DeepInfra (FP8) is the standout balanced provider: #2 lowest blended price ($0.44 / 1M tokens) and #3 lowest latency (0.56s TTFT).
DeepInfra (FP8) leads on token pricing: #2 lowest input price ($0.27 / 1M) and #1 lowest output price ($0.95 / 1M output tokens).
Category leaders: SambaNova fastest output speed (394.6 t/s), Together.ai (FP4) lowest latency (0.42s), SiliconFlow (FP8) lowest blended price ($0.40).

MiniMax-M2.5 — Best APIs

Provider	Why It’s Best	Blended ($/1M)	Input ($/1M)	Output ($/1M)	Latency (TTFT)	Speed (t/s)	E2E (s/500 tok)
DeepInfra (FP8)	Best value + low latency balance; cheapest output tokens; strong for cost-sensitive apps needing snappy first-token response	$0.44	$0.27	$0.95	0.56s	66	38.64s
SiliconFlow (FP8)	Lowest blended price overall (budget-first)	$0.40	$0.20	$1.00	1.90s	85	31.47s
Together.ai (FP4)	Lowest latency (interactive-first)	$0.53	—	—	0.42s	95	26.80s
Fireworks	Very high throughput (speed-first)	$0.53	—	—	0.76s	193	13.71s
Clarifai	Strong low-latency + good speed combination	$0.53	—	—	0.54s	143	18.07s
SambaNova	Fastest output speed overall (throughput-max)	$0.53	—	—	1.60s	395	7.93s

Quick Verdict: Which MiniMax-M2.5 Provider is Best?

Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale MiniMax-M2.5 deployment. It offers the best balance of low latency, competitive pricing, and full feature support. For use cases requiring maximum throughput, SambaNova leads the field. For absolute lowest latency, Together.ai is the fastest provider tested.

Overall Recommendation: DeepInfra (FP8)

DeepInfra emerges as the superior choice for production workloads, offering the most robust balance of low latency, competitive pricing, and feature completeness. While other providers may win in a single metric, DeepInfra consistently scores in the top tier across all critical categories without significant trade-offs.

Latency (TTFT): 0.56s (3rd lowest overall)
Output Speed: 66 t/s
Pricing: $0.27 Input / $0.95 Output ($0.44 Blended)
Context Window: 197k
Feature Support: Full support for both JSON Mode and Function Calling
Best For: RAG applications, Agentic workflows, General Chat

DeepInfra utilizes FP8 quantization to deliver a TTFT of 0.56 seconds — nearly indistinguishable from the fastest provider (Together.ai at 0.42s) for human perception. Crucially, it achieves this while being significantly cheaper ($0.44 vs $0.53 per 1M tokens) than the majority of the market.

Unlike the fastest throughput providers (SambaNova) which suffer from high latency (1.60s), DeepInfra maintains a snappy interactive feel. For developers building RAG applications or agents requiring tool use, DeepInfra’s combination of low latency, sub-$0.50 pricing, and full tool-calling support makes it the definitive all-rounder.

Integration Example (Python)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DEEPINFRA_KEY",
    base_url="https://api.deepinfra.com/v1/openai"
)

response = client.chat.completions.create(
    model="minimax/minimax-m2.5",
    messages=[{"role": "user", "content": "Explain quantum entanglement."}],
)copy

The Throughput Specialist: SambaNova

If your use case involves batch processing or generating long-form content where the start time matters less than the completion time, SambaNova is the undisputed leader in throughput.

Output Speed: 394.6 tokens/second (13x faster than the slowest provider)
Latency (TTFT): 1.60s
Price: $0.53 per 1M tokens
Context Window: 164k
Best For: Batch processing, Summarization, Offline jobs

SambaNova’s architecture delivers 394.6 t/s — roughly 2x faster than the second-fastest provider (Fireworks) and 13x faster than the slowest. SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput at the cost of higher initial latency.

This speed comes with trade-offs. SambaNova has the lowest context window in the benchmark set (164k vs the standard ~200k) and a relatively high TTFT of 1.60s. This makes it ideal for background generation tasks but less suitable for real-time conversational interfaces.

The Latency Leader: Together.ai (FP4)

For real-time chat applications where perceived speed is defined by how quickly the first word appears, Together.ai takes the lead.

Latency (TTFT): 0.42s
Output Speed: 95 tokens/second
Price: $0.53 per 1M tokens
Quantization: FP4
Context Window: 197k
Best For: Real-time conversational AI, Customer support bots

Utilizing aggressive FP4 quantization, Together.ai achieves the lowest latency in the field at 0.42s. However, this comes at a premium price point ($0.53) compared to budget options. Its output speed of 95 t/s is also significantly slower than the top throughput providers — it is the fastest to start, but not the fastest to finish large generations.

The Cost Efficiency Leader: SiliconFlow (FP8)

For developers operating on tight margins or processing massive volumes of non-time-sensitive data, SiliconFlow offers the absolute lowest floor price.

Blended Price: $0.40 per 1M tokens ($0.20 Input / $1.00 Output)
Latency (TTFT): 1.90s
Output Speed: 85 tokens/second
Context Window: 197k
Best For: Academic research, Hobbyist projects, Non-urgent data extraction

SiliconFlow is the most affordable provider analyzed, undercutting the standard market rate by roughly 24%. However, this cost saving comes with a latency trade-off. With a TTFT of 1.90s, it has one of the slowest response times in the benchmark — nearly 4x slower than DeepInfra. It is an excellent choice for offline batch jobs but is not recommended for user-facing applications.

The High-Performance Contender: Fireworks

Fireworks serves as a strong alternative for those who need high speed but cannot tolerate the high latency of SambaNova.

Output Speed: 193.1 tokens/second
Latency (TTFT): 0.76s
Price: $0.53 per 1M tokens
Context Window: 197k

Fireworks holds the #2 spot for output speed (193.1 t/s) while maintaining a respectable sub-second latency (0.76s). It bridges the gap between the speed leaders and the latency leaders. However, at $0.53 per 1M tokens, it is notably more expensive than DeepInfra without offering the same low-latency benefits.

Comparative Specs Table

Provider	Optimization	Input Price	Output Price	Latency (TTFT)	Speed (t/s)	Context	JSON / Tools
DeepInfra	Balanced (Recommended)	$0.27	$0.95	0.56s	66	197k	Yes / Yes
SambaNova	Throughput	$0.50	$0.55	1.60s	394.6	164k	Yes / Yes
Together.ai	Latency	$0.50	$0.55	0.42s	95	197k	Yes / Yes
SiliconFlow	Cost	$0.20	$1.00	1.90s	85	197k	Yes / Yes
Fireworks	Speed / Hybrid	$0.50	$0.55	0.76s	193.1	197k	Yes / Yes
Clarifai	Hybrid	$0.50	$0.55	0.54s	142.6	205k	Yes / Yes
MiniMax Direct	Native	$0.50	$0.55	3.23s	49	205k	No / Yes

Frequently Asked Questions

Which MiniMax-M2.5 provider is best for RAG?

DeepInfra is the best choice for RAG due to its low TTFT (0.56s) and full support for JSON Mode and Function Calling, which are essential for retrieving and formatting context.

Does MiniMax-M2.5 support Function Calling?

Yes, the model supports function calling, but not all providers enable it. DeepInfra, Together.ai, and Fireworks support full tool-use, while the MiniMax Direct API currently has limited support.

Why is SambaNova so much faster?

SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput (394 t/s) at the cost of higher initial latency.

Is MiniMax-M2.5 good for coding?

Yes. MiniMax-M2.5 achieves state-of-the-art performance in programming evaluations, scoring 80.2% on SWE-Bench Verified. The model was trained on over 10 languages across more than 200,000 real-world environments and excels at the entire development lifecycle — from system design to code review.

DeepInfra vs. Together.ai for MiniMax-M2.5?

DeepInfra offers better value at $0.44/1M tokens vs Together.ai’s $0.53/1M tokens. Together.ai has slightly lower latency (0.42s vs 0.56s), but for most applications, this 140ms difference is imperceptible to users. DeepInfra is the recommended choice unless sub-half-second latency is absolutely critical.

Conclusion

For the vast majority of MiniMax-M2.5 implementations, DeepInfra is the logical choice. It provides a premium low-latency experience (0.56s) usually reserved for more expensive providers, while maintaining a near-bottom-tier price point ($0.44). While SambaNova is technically superior for pure bulk text generation, DeepInfra’s versatility across RAG, agents, and chat interfaces makes it the standout provider in this benchmark.

Accelerating Reasoning Workflows with Nemotron 3 Nano on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano, the newest open reasoning model in the Nemotron family. Our goal is to give developers, researchers, and teams the fastest and simplest path to using Nemotron 3 Nano from day one.

Llama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep Infra<p>Llama 3.1 70B Instruct is Meta’s widely-used, instruction-tuned model for high-quality dialogue and tool use. With a ~131K-token context window, it can read long prompts and multi-file inputs—great for agents, RAG, and IDE assistants. But how “good” it feels in practice depends just as much on the inference provider as on the model: infra, batching, […]</p>

Model Distillation Making AI Models EfficientAI Model Distillation Definition & Methodology Model distillation is the art of teaching a smaller, simpler model to perform as well as a larger one. It's like training an apprentice to take over a master's work—streamlining operations with comparable performance . If you're struggling with depl...

View all