GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through complex problems.
M2.5 was trained extensively with reinforcement learning across more than 200,000 real-world environments, covering over 10 programming languages including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby. A notable characteristic is its spec-writing tendency — before writing any code, M2.5 actively decomposes and plans features, structure, and UI design from the perspective of an experienced software architect.
The model achieves industry-leading benchmark scores: 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp. It completes SWE-Bench Verified evaluations 37% faster than its predecessor M2.1 while consuming fewer tokens. Beyond coding, M2.5 excels at office productivity tasks including generating formatted Word documents, PowerPoint presentations, and Excel spreadsheets with working formulas. The model weights are fully open-sourced on HuggingFace under a Modified MIT License.
MiniMax-M2.5 is now available across multiple inference providers, but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Why It’s Best | Blended ($/1M) | Input ($/1M) | Output ($/1M) | Latency (TTFT) | Speed (t/s) | E2E (s/500 tok) |
|---|---|---|---|---|---|---|---|
| DeepInfra (FP8) | Best value + low latency balance; cheapest output tokens; strong for cost-sensitive apps needing snappy first-token response | $0.44 | $0.27 | $0.95 | 0.56s | 66 | 38.64s |
| SiliconFlow (FP8) | Lowest blended price overall (budget-first) | $0.40 | $0.20 | $1.00 | 1.90s | 85 | 31.47s |
| Together.ai (FP4) | Lowest latency (interactive-first) | $0.53 | — | — | 0.42s | 95 | 26.80s |
| Fireworks | Very high throughput (speed-first) | $0.53 | — | — | 0.76s | 193 | 13.71s |
| Clarifai | Strong low-latency + good speed combination | $0.53 | — | — | 0.54s | 143 | 18.07s |
| SambaNova | Fastest output speed overall (throughput-max) | $0.53 | — | — | 1.60s | 395 | 7.93s |
Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale MiniMax-M2.5 deployment. It offers the best balance of low latency, competitive pricing, and full feature support. For use cases requiring maximum throughput, SambaNova leads the field. For absolute lowest latency, Together.ai is the fastest provider tested.
DeepInfra emerges as the superior choice for production workloads, offering the most robust balance of low latency, competitive pricing, and feature completeness. While other providers may win in a single metric, DeepInfra consistently scores in the top tier across all critical categories without significant trade-offs.
DeepInfra utilizes FP8 quantization to deliver a TTFT of 0.56 seconds — nearly indistinguishable from the fastest provider (Together.ai at 0.42s) for human perception. Crucially, it achieves this while being significantly cheaper ($0.44 vs $0.53 per 1M tokens) than the majority of the market.
Unlike the fastest throughput providers (SambaNova) which suffer from high latency (1.60s), DeepInfra maintains a snappy interactive feel. For developers building RAG applications or agents requiring tool use, DeepInfra’s combination of low latency, sub-$0.50 pricing, and full tool-calling support makes it the definitive all-rounder.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_DEEPINFRA_KEY",
base_url="https://api.deepinfra.com/v1/openai"
)
response = client.chat.completions.create(
model="minimax/minimax-m2.5",
messages=[{"role": "user", "content": "Explain quantum entanglement."}],
)If your use case involves batch processing or generating long-form content where the start time matters less than the completion time, SambaNova is the undisputed leader in throughput.
SambaNova’s architecture delivers 394.6 t/s — roughly 2x faster than the second-fastest provider (Fireworks) and 13x faster than the slowest. SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput at the cost of higher initial latency.
This speed comes with trade-offs. SambaNova has the lowest context window in the benchmark set (164k vs the standard ~200k) and a relatively high TTFT of 1.60s. This makes it ideal for background generation tasks but less suitable for real-time conversational interfaces.
For real-time chat applications where perceived speed is defined by how quickly the first word appears, Together.ai takes the lead.
Utilizing aggressive FP4 quantization, Together.ai achieves the lowest latency in the field at 0.42s. However, this comes at a premium price point ($0.53) compared to budget options. Its output speed of 95 t/s is also significantly slower than the top throughput providers — it is the fastest to start, but not the fastest to finish large generations.
For developers operating on tight margins or processing massive volumes of non-time-sensitive data, SiliconFlow offers the absolute lowest floor price.
SiliconFlow is the most affordable provider analyzed, undercutting the standard market rate by roughly 24%. However, this cost saving comes with a latency trade-off. With a TTFT of 1.90s, it has one of the slowest response times in the benchmark — nearly 4x slower than DeepInfra. It is an excellent choice for offline batch jobs but is not recommended for user-facing applications.
Fireworks serves as a strong alternative for those who need high speed but cannot tolerate the high latency of SambaNova.
Fireworks holds the #2 spot for output speed (193.1 t/s) while maintaining a respectable sub-second latency (0.76s). It bridges the gap between the speed leaders and the latency leaders. However, at $0.53 per 1M tokens, it is notably more expensive than DeepInfra without offering the same low-latency benefits.
| Provider | Optimization | Input Price | Output Price | Latency (TTFT) | Speed (t/s) | Context | JSON / Tools |
|---|---|---|---|---|---|---|---|
| DeepInfra | Balanced (Recommended) | $0.27 | $0.95 | 0.56s | 66 | 197k | Yes / Yes |
| SambaNova | Throughput | $0.50 | $0.55 | 1.60s | 394.6 | 164k | Yes / Yes |
| Together.ai | Latency | $0.50 | $0.55 | 0.42s | 95 | 197k | Yes / Yes |
| SiliconFlow | Cost | $0.20 | $1.00 | 1.90s | 85 | 197k | Yes / Yes |
| Fireworks | Speed / Hybrid | $0.50 | $0.55 | 0.76s | 193.1 | 197k | Yes / Yes |
| Clarifai | Hybrid | $0.50 | $0.55 | 0.54s | 142.6 | 205k | Yes / Yes |
| MiniMax Direct | Native | $0.50 | $0.55 | 3.23s | 49 | 205k | No / Yes |
DeepInfra is the best choice for RAG due to its low TTFT (0.56s) and full support for JSON Mode and Function Calling, which are essential for retrieving and formatting context.
Yes, the model supports function calling, but not all providers enable it. DeepInfra, Together.ai, and Fireworks support full tool-use, while the MiniMax Direct API currently has limited support.
SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput (394 t/s) at the cost of higher initial latency.
Yes. MiniMax-M2.5 achieves state-of-the-art performance in programming evaluations, scoring 80.2% on SWE-Bench Verified. The model was trained on over 10 languages across more than 200,000 real-world environments and excels at the entire development lifecycle — from system design to code review.
DeepInfra offers better value at $0.44/1M tokens vs Together.ai’s $0.53/1M tokens. Together.ai has slightly lower latency (0.42s vs 0.56s), but for most applications, this 140ms difference is imperceptible to users. DeepInfra is the recommended choice unless sub-half-second latency is absolutely critical.
For the vast majority of MiniMax-M2.5 implementations, DeepInfra is the logical choice. It provides a premium low-latency experience (0.56s) usually reserved for more expensive providers, while maintaining a near-bottom-tier price point ($0.44). While SambaNova is technically superior for pure bulk text generation, DeepInfra’s versatility across RAG, agents, and chat interfaces makes it the standout provider in this benchmark.
Lzlv model for roleplaying and creative workRecently an interesting new model got released.
It is called Lzlv, and it is basically
a merge of few existing models. This model is using the Vicuna prompt format, so keep this
in mind if you are using our raw [API](/lizpreciatior/lzlv_70b...
Llama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep Infra<p>Llama 3.1 70B Instruct is Meta’s widely-used, instruction-tuned model for high-quality dialogue and tool use. With a ~131K-token context window, it can read long prompts and multi-file inputs—great for agents, RAG, and IDE assistants. But how “good” it feels in practice depends just as much on the inference provider as on the model: infra, batching, […]</p>
How to deploy google/flan-ul2 - simple. (open source ChatGPT alternative)Flan-UL2 is probably the best open source model available right now for chatbots. In this post
we will show you how to get started with it very easily. Flan-UL2 is large -
20B parameters. It is fine tuned version of the UL2 model using Flan dataset.
Because this is quite a large model it is not eas...© 2026 Deep Infra. All rights reserved.