MiMo-V2.5 Model Documentation and Integration Guide

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.07.01 by DeepInfra

MiMo-V2.5 is a native omnimodal model developed by XiaomiMiMo, designed to process and understand text, image, video, and audio through a unified architecture rather than relying on “bolted-on” components for each modality.

Built on a 310-billion-parameter Sparse Mixture of Experts (MoE) architecture — with only 15 billion parameters activated during inference — MiMo-V2.5 offers a strong balance of high-tier reasoning and computational efficiency. With a 1-million-token context window and agentic capabilities, it is engineered for complex multimodal perception, long-context reasoning, and autonomous workflows.

Architectural Capabilities

MiMo-V2.5 represents a meaningful step forward from its predecessor, MiMo-V2-Flash. By utilizing native, dedicated encoders for diverse data types, the model achieves a level of cohesion not commonly seen in large-scale models.

Key Technical Features

Native Omnimodal Encoders: Includes a 729-million-parameter Vision Transformer with hybrid window attention and a 261-million-parameter audio encoder.
Hybrid Attention Architecture: By interleaving Sliding Window Attention (SWA) and Global Attention (GA) in a 5:1 ratio, the model reduces KV-cache storage requirements by roughly 6× without sacrificing long-context integrity.
Multi-Token Prediction (MTP): Three lightweight MTP modules (329M parameters) accelerate inference through speculative decoding and improve the efficiency of reinforcement learning.
Advanced Training: Trained on approximately 48 trillion tokens using FP8 mixed precision, the model has undergone Supervised Fine-Tuning (SFT) and Multi-Teacher On-Policy Distillation (MOPD) to perform well on agentic tasks.

Configuration Notice: Developers who downloaded the model prior to recent repository updates should re-pull the config.json and tokenizer_config.json files to ensure optimal performance and avoid degraded behavior.

Performance and Benchmarks

MiMo-V2.5 demonstrates competitive performance against frontier closed-source models, particularly in coding, temporal video reasoning, and agentic decision-making.

Agentic and Coding Performance

The model’s use of Reinforcement Learning (RL) places it near the Pareto frontier for daily agentic tasks.

Benchmark	Category	MiMo-V2.5 Score	Claude Opus 4.6	Gemini 3.1 Pro
Coding (General)	Programming/Logic	71.8	77.1	67.8
Claw-Eval Text	General Agentic	65.8	70.8	68.5
Terminal-Bench 2.0	CLI Operations	56.1	57.3	54.2

Multimodal Perception

MiMo-V2.5 shows sharp perception for temporal reasoning, matching or approaching industry leaders in video and image understanding.

Benchmark	Modality	MiMo-V2.5 Score	Gemini 3 Pro	Kimi K2.6
Image Understanding	Vision-Language	81.0	81.4	80.4
Video-MME	Video	83.5	84.2	—
MMMU-Pro	Multi-discipline	88.5	—	—
CharXiv RQ	Chart/Diagram	77.9	81.0	79.4

Long-Context Integrity

The model supports up to 1,000,000 tokens, validated through benchmarks like Graphwalks for path-finding and retrieval. A learnable attention sink bias helps reasoning accuracy remain stable even at the 1M token limit.

Getting Started with the API

MiMo-V2.5 is hosted on DeepInfra, providing high-performance, low-latency inference via an OpenAI-compatible API.

Authentication

Retrieve your API key from your DeepInfra Dashboard and include it in your HTTP headers:

Authorization: Bearer <YOUR_DEEPINFRA_API_KEY>

API Basics

Base URL: https://api.deepinfra.com/v1/openai
Endpoint: POST /chat/completions

Implementation Examples

Using cURL

curl -X POST https://api.deepinfra.com/v1/openai/chat/completions \
  -H "Authorization: Bearer $DEEPINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "XiaomiMiMo/MiMo-V2.5",
    "messages": [
      {
        "role": "user",
        "content": "Explain the advantages of a hybrid attention architecture in 2 sentences."
      }
    ]
  }'copy

Using Python

import os
import requests


url = "https://api.deepinfra.com/v1/openai/chat/completions"
api_key = os.getenv("DEEPINFRA_API_KEY")


payload = {
    "model": "XiaomiMiMo/MiMo-V2.5",
    "messages": [{"role": "user", "content": "Explain the advantages of a hybrid attention architecture."}]
}


response = requests.post(url, headers={"Authorization": f"Bearer {api_key}"}, json=payload)
print(response.json())copy

Pricing and Service Tiers

Pricing is usage-based, calculated per 1 million tokens. DeepInfra offers two tiers to balance cost and priority.

Pricing Table (Per 1M Tokens)

Tier	Input Price	Output Price	Cached Input Price
Standard	$0.40	$2.00	$0.08
Priority (1.5×)	$0.60	$3.00	$0.12

Key Pricing Considerations

Cached Input Discount: Tokens successfully retrieved from the cache are billed at a significantly reduced rate ($0.08/1M tokens on Standard), making long-context conversations more cost-effective.
Priority Tier: Users requiring lower latency and prioritized processing can opt for the Priority Tier, which applies a 1.5× multiplier to all costs.
Free Tier: Refer to the DeepInfra Pricing Page for current free-tier availability and limitations.

Conclusion

XiaomiMiMo’s MiMo-V2.5 is a capable and versatile model for the next generation of AI applications. By combining a 1M token context window with native omnimodal understanding and an efficient MoE architecture, it gives developers frontier-model capabilities at a comparatively lower resource cost.

Whether you are building agentic workflows, analyzing hour-long videos, or processing large document sets, MiMo-V2.5 offers the performance and flexibility for professional-grade deployment.

Introducing GLM-5.2 on DeepInfra<p>GLM-5.2 is Z-AI’s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding […]</p>

Hosted Agents: your own always-on AI agent, from $13/monthOne click gives you a dedicated, isolated AI agent, pre-wired to fast inference and ready to work the moment it boots. No VMs, no SSH hardening, no patching. From $13/month, and idle is free.

Step 3.7 Flash is Live on DeepInfra: An Agentic, Multimodal Model Built for ProductionStepFun's Step 3.7 Flash is now live on DeepInfra. It's a 198B-parameter sparse MoE vision-language model with just ~11B active parameters per token, a 256K context window, and three selectable reasoning levels—purpose-built for high-throughput agentic workflows that combine perception, search, and reasoning.

View all