We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

MiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
MiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost

About MiniMax-M2.5

MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through complex problems.

M2.5 was trained extensively with reinforcement learning across more than 200,000 real-world environments, covering over 10 programming languages including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby. A notable characteristic is its spec-writing tendency — before writing any code, M2.5 actively decomposes and plans features, structure, and UI design from the perspective of an experienced software architect.

The model achieves industry-leading benchmark scores: 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp. It completes SWE-Bench Verified evaluations 37% faster than its predecessor M2.1 while consuming fewer tokens. Beyond coding, M2.5 excels at office productivity tasks including generating formatted Word documents, PowerPoint presentations, and Excel spreadsheets with working formulas. The model weights are fully open-sourced on HuggingFace under a Modified MIT License.

MiniMax-M2.5 is now available across multiple inference providers, but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

MiniMax-M2.5 API Review Summary

  • Open weights model (released February 2026) with multiple tracked API providers.
  • Benchmarks reflect sustained performance: median (P50) over the past 72 hours, using a default workload of 10,000 input tokens.
  • DeepInfra (FP8) is the standout balanced provider: #2 lowest blended price ($0.44 / 1M tokens) and #3 lowest latency (0.56s TTFT).
  • DeepInfra (FP8) leads on token pricing: #2 lowest input price ($0.27 / 1M) and #1 lowest output price ($0.95 / 1M output tokens).
  • Category leaders: SambaNova fastest output speed (394.6 t/s), Together.ai (FP4) lowest latency (0.42s), SiliconFlow (FP8) lowest blended price ($0.40).

MiniMax-M2.5 — Best APIs

ProviderWhy It’s BestBlended ($/1M)Input ($/1M)Output ($/1M)Latency (TTFT)Speed (t/s)E2E (s/500 tok)
DeepInfra (FP8)Best value + low latency balance; cheapest output tokens; strong for cost-sensitive apps needing snappy first-token response$0.44$0.27$0.950.56s6638.64s
SiliconFlow (FP8)Lowest blended price overall (budget-first)$0.40$0.20$1.001.90s8531.47s
Together.ai (FP4)Lowest latency (interactive-first)$0.530.42s9526.80s
FireworksVery high throughput (speed-first)$0.530.76s19313.71s
ClarifaiStrong low-latency + good speed combination$0.530.54s14318.07s
SambaNovaFastest output speed overall (throughput-max)$0.531.60s3957.93s

Quick Verdict: Which MiniMax-M2.5 Provider is Best?

Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale MiniMax-M2.5 deployment. It offers the best balance of low latency, competitive pricing, and full feature support. For use cases requiring maximum throughput, SambaNova leads the field. For absolute lowest latency, Together.ai is the fastest provider tested.

Overall Recommendation: DeepInfra (FP8)

DeepInfra emerges as the superior choice for production workloads, offering the most robust balance of low latency, competitive pricing, and feature completeness. While other providers may win in a single metric, DeepInfra consistently scores in the top tier across all critical categories without significant trade-offs.

  • Latency (TTFT): 0.56s (3rd lowest overall)
  • Output Speed: 66 t/s
  • Pricing: $0.27 Input / $0.95 Output ($0.44 Blended)
  • Context Window: 197k
  • Feature Support: Full support for both JSON Mode and Function Calling
  • Best For: RAG applications, Agentic workflows, General Chat

DeepInfra utilizes FP8 quantization to deliver a TTFT of 0.56 seconds — nearly indistinguishable from the fastest provider (Together.ai at 0.42s) for human perception. Crucially, it achieves this while being significantly cheaper ($0.44 vs $0.53 per 1M tokens) than the majority of the market.

Unlike the fastest throughput providers (SambaNova) which suffer from high latency (1.60s), DeepInfra maintains a snappy interactive feel. For developers building RAG applications or agents requiring tool use, DeepInfra’s combination of low latency, sub-$0.50 pricing, and full tool-calling support makes it the definitive all-rounder.

Integration Example (Python)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DEEPINFRA_KEY",
    base_url="https://api.deepinfra.com/v1/openai"
)

response = client.chat.completions.create(
    model="minimax/minimax-m2.5",
    messages=[{"role": "user", "content": "Explain quantum entanglement."}],
)
copy

The Throughput Specialist: SambaNova

If your use case involves batch processing or generating long-form content where the start time matters less than the completion time, SambaNova is the undisputed leader in throughput.

  • Output Speed: 394.6 tokens/second (13x faster than the slowest provider)
  • Latency (TTFT): 1.60s
  • Price: $0.53 per 1M tokens
  • Context Window: 164k
  • Best For: Batch processing, Summarization, Offline jobs

SambaNova’s architecture delivers 394.6 t/s — roughly 2x faster than the second-fastest provider (Fireworks) and 13x faster than the slowest. SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput at the cost of higher initial latency.

This speed comes with trade-offs. SambaNova has the lowest context window in the benchmark set (164k vs the standard ~200k) and a relatively high TTFT of 1.60s. This makes it ideal for background generation tasks but less suitable for real-time conversational interfaces.

The Latency Leader: Together.ai (FP4)

For real-time chat applications where perceived speed is defined by how quickly the first word appears, Together.ai takes the lead.

  • Latency (TTFT): 0.42s
  • Output Speed: 95 tokens/second
  • Price: $0.53 per 1M tokens
  • Quantization: FP4
  • Context Window: 197k
  • Best For: Real-time conversational AI, Customer support bots

Utilizing aggressive FP4 quantization, Together.ai achieves the lowest latency in the field at 0.42s. However, this comes at a premium price point ($0.53) compared to budget options. Its output speed of 95 t/s is also significantly slower than the top throughput providers — it is the fastest to start, but not the fastest to finish large generations.

The Cost Efficiency Leader: SiliconFlow (FP8)

For developers operating on tight margins or processing massive volumes of non-time-sensitive data, SiliconFlow offers the absolute lowest floor price.

  • Blended Price: $0.40 per 1M tokens ($0.20 Input / $1.00 Output)
  • Latency (TTFT): 1.90s
  • Output Speed: 85 tokens/second
  • Context Window: 197k
  • Best For: Academic research, Hobbyist projects, Non-urgent data extraction

SiliconFlow is the most affordable provider analyzed, undercutting the standard market rate by roughly 24%. However, this cost saving comes with a latency trade-off. With a TTFT of 1.90s, it has one of the slowest response times in the benchmark — nearly 4x slower than DeepInfra. It is an excellent choice for offline batch jobs but is not recommended for user-facing applications.

The High-Performance Contender: Fireworks

Fireworks serves as a strong alternative for those who need high speed but cannot tolerate the high latency of SambaNova.

  • Output Speed: 193.1 tokens/second
  • Latency (TTFT): 0.76s
  • Price: $0.53 per 1M tokens
  • Context Window: 197k

Fireworks holds the #2 spot for output speed (193.1 t/s) while maintaining a respectable sub-second latency (0.76s). It bridges the gap between the speed leaders and the latency leaders. However, at $0.53 per 1M tokens, it is notably more expensive than DeepInfra without offering the same low-latency benefits.

Comparative Specs Table

ProviderOptimizationInput PriceOutput PriceLatency (TTFT)Speed (t/s)ContextJSON / Tools
DeepInfraBalanced (Recommended)$0.27$0.950.56s66197kYes / Yes
SambaNovaThroughput$0.50$0.551.60s394.6164kYes / Yes
Together.aiLatency$0.50$0.550.42s95197kYes / Yes
SiliconFlowCost$0.20$1.001.90s85197kYes / Yes
FireworksSpeed / Hybrid$0.50$0.550.76s193.1197kYes / Yes
ClarifaiHybrid$0.50$0.550.54s142.6205kYes / Yes
MiniMax DirectNative$0.50$0.553.23s49205kNo / Yes

Frequently Asked Questions

Which MiniMax-M2.5 provider is best for RAG?

DeepInfra is the best choice for RAG due to its low TTFT (0.56s) and full support for JSON Mode and Function Calling, which are essential for retrieving and formatting context.

Does MiniMax-M2.5 support Function Calling?

Yes, the model supports function calling, but not all providers enable it. DeepInfra, Together.ai, and Fireworks support full tool-use, while the MiniMax Direct API currently has limited support.

Why is SambaNova so much faster?

SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput (394 t/s) at the cost of higher initial latency.

Is MiniMax-M2.5 good for coding?

Yes. MiniMax-M2.5 achieves state-of-the-art performance in programming evaluations, scoring 80.2% on SWE-Bench Verified. The model was trained on over 10 languages across more than 200,000 real-world environments and excels at the entire development lifecycle — from system design to code review.

DeepInfra vs. Together.ai for MiniMax-M2.5?

DeepInfra offers better value at $0.44/1M tokens vs Together.ai’s $0.53/1M tokens. Together.ai has slightly lower latency (0.42s vs 0.56s), but for most applications, this 140ms difference is imperceptible to users. DeepInfra is the recommended choice unless sub-half-second latency is absolutely critical.

Conclusion

For the vast majority of MiniMax-M2.5 implementations, DeepInfra is the logical choice. It provides a premium low-latency experience (0.56s) usually reserved for more expensive providers, while maintaining a near-bottom-tier price point ($0.44). While SambaNova is technically superior for pure bulk text generation, DeepInfra’s versatility across RAG, agents, and chat interfaces makes it the standout provider in this benchmark.

Related articles
Lzlv model for roleplaying and creative workLzlv model for roleplaying and creative workRecently an interesting new model got released. It is called Lzlv, and it is basically a merge of few existing models. This model is using the Vicuna prompt format, so keep this in mind if you are using our raw [API](/lizpreciatior/lzlv_70b...
Llama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep InfraLlama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep Infra<p>Llama 3.1 70B Instruct is Meta’s widely-used, instruction-tuned model for high-quality dialogue and tool use. With a ~131K-token context window, it can read long prompts and multi-file inputs—great for agents, RAG, and IDE assistants. But how “good” it feels in practice depends just as much on the inference provider as on the model: infra, batching, [&hellip;]</p>
How to deploy google/flan-ul2 - simple. (open source ChatGPT alternative)How to deploy google/flan-ul2 - simple. (open source ChatGPT alternative)Flan-UL2 is probably the best open source model available right now for chatbots. In this post we will show you how to get started with it very easily. Flan-UL2 is large - 20B parameters. It is fine tuned version of the UL2 model using Flan dataset. Because this is quite a large model it is not eas...