We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

MiMo-V2.5 Model Documentation and Integration Guide
Published on 2026.07.01 by DeepInfra
MiMo-V2.5 Model Documentation and Integration Guide

MiMo-V2.5 is a native omnimodal model developed by XiaomiMiMo, designed to process and understand text, image, video, and audio through a unified architecture rather than relying on “bolted-on” components for each modality.

Built on a 310-billion-parameter Sparse Mixture of Experts (MoE) architecture — with only 15 billion parameters activated during inference — MiMo-V2.5 offers a strong balance of high-tier reasoning and computational efficiency. With a 1-million-token context window and agentic capabilities, it is engineered for complex multimodal perception, long-context reasoning, and autonomous workflows.

Architectural Capabilities

MiMo-V2.5 represents a meaningful step forward from its predecessor, MiMo-V2-Flash. By utilizing native, dedicated encoders for diverse data types, the model achieves a level of cohesion not commonly seen in large-scale models.

Key Technical Features

  • Native Omnimodal Encoders: Includes a 729-million-parameter Vision Transformer with hybrid window attention and a 261-million-parameter audio encoder.
  • Hybrid Attention Architecture: By interleaving Sliding Window Attention (SWA) and Global Attention (GA) in a 5:1 ratio, the model reduces KV-cache storage requirements by roughly 6× without sacrificing long-context integrity.
  • Multi-Token Prediction (MTP): Three lightweight MTP modules (329M parameters) accelerate inference through speculative decoding and improve the efficiency of reinforcement learning.
  • Advanced Training: Trained on approximately 48 trillion tokens using FP8 mixed precision, the model has undergone Supervised Fine-Tuning (SFT) and Multi-Teacher On-Policy Distillation (MOPD) to perform well on agentic tasks.

Configuration Notice: Developers who downloaded the model prior to recent repository updates should re-pull the config.json and tokenizer_config.json files to ensure optimal performance and avoid degraded behavior.

Performance and Benchmarks

MiMo-V2.5 demonstrates competitive performance against frontier closed-source models, particularly in coding, temporal video reasoning, and agentic decision-making.

Agentic and Coding Performance

The model’s use of Reinforcement Learning (RL) places it near the Pareto frontier for daily agentic tasks.

BenchmarkCategoryMiMo-V2.5 ScoreClaude Opus 4.6Gemini 3.1 Pro
Coding (General)Programming/Logic71.877.167.8
Claw-Eval TextGeneral Agentic65.870.868.5
Terminal-Bench 2.0CLI Operations56.157.354.2

Multimodal Perception

MiMo-V2.5 shows sharp perception for temporal reasoning, matching or approaching industry leaders in video and image understanding.

BenchmarkModalityMiMo-V2.5 ScoreGemini 3 ProKimi K2.6
Image UnderstandingVision-Language81.081.480.4
Video-MMEVideo83.584.2
MMMU-ProMulti-discipline88.5
CharXiv RQChart/Diagram77.981.079.4

Long-Context Integrity

The model supports up to 1,000,000 tokens, validated through benchmarks like Graphwalks for path-finding and retrieval. A learnable attention sink bias helps reasoning accuracy remain stable even at the 1M token limit.

Getting Started with the API

MiMo-V2.5 is hosted on DeepInfra, providing high-performance, low-latency inference via an OpenAI-compatible API.

Authentication

Retrieve your API key from your DeepInfra Dashboard and include it in your HTTP headers:

Authorization: Bearer <YOUR_DEEPINFRA_API_KEY>

API Basics

  • Base URL: https://api.deepinfra.com/v1/openai
  • Endpoint: POST /chat/completions

Implementation Examples

Using cURL

curl -X POST https://api.deepinfra.com/v1/openai/chat/completions \
  -H "Authorization: Bearer $DEEPINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "XiaomiMiMo/MiMo-V2.5",
    "messages": [
      {
        "role": "user",
        "content": "Explain the advantages of a hybrid attention architecture in 2 sentences."
      }
    ]
  }'
copy

Using Python

import os
import requests


url = "https://api.deepinfra.com/v1/openai/chat/completions"
api_key = os.getenv("DEEPINFRA_API_KEY")


payload = {
    "model": "XiaomiMiMo/MiMo-V2.5",
    "messages": [{"role": "user", "content": "Explain the advantages of a hybrid attention architecture."}]
}


response = requests.post(url, headers={"Authorization": f"Bearer {api_key}"}, json=payload)
print(response.json())
copy

Pricing and Service Tiers

Pricing is usage-based, calculated per 1 million tokens. DeepInfra offers two tiers to balance cost and priority.

Pricing Table (Per 1M Tokens)

TierInput PriceOutput PriceCached Input Price
Standard$0.40$2.00$0.08
Priority (1.5×)$0.60$3.00$0.12

Key Pricing Considerations

  • Cached Input Discount: Tokens successfully retrieved from the cache are billed at a significantly reduced rate ($0.08/1M tokens on Standard), making long-context conversations more cost-effective.
  • Priority Tier: Users requiring lower latency and prioritized processing can opt for the Priority Tier, which applies a 1.5× multiplier to all costs.
  • Free Tier: Refer to the DeepInfra Pricing Page for current free-tier availability and limitations.

Conclusion

XiaomiMiMo’s MiMo-V2.5 is a capable and versatile model for the next generation of AI applications. By combining a 1M token context window with native omnimodal understanding and an efficient MoE architecture, it gives developers frontier-model capabilities at a comparatively lower resource cost.

Whether you are building agentic workflows, analyzing hour-long videos, or processing large document sets, MiMo-V2.5 offers the performance and flexibility for professional-grade deployment.

Related articles
Kimi K2.6 Model Overview: Architecture, Features & CapabilitiesKimi K2.6 Model Overview: Architecture, Features & Capabilities<p>Kimi K2.6 is Moonshot AI&#8217;s latest flagship open-source model, released on April 20, 2026 under a Modified MIT license. It is a native multimodal agentic model built on a 1-trillion parameter Mixture-of-Experts (MoE) architecture, with 32 billion parameters activated per token. The model is designed for long-horizon coding, autonomous execution, and multi-agent orchestration, and is [&hellip;]</p>
GLM-5.1 Model Overview: Features, Capabilities & Use CasesGLM-5.1 Model Overview: Features, Capabilities & Use Cases<p>GLM-5.1 is Z.AI&#8217;s next-generation flagship model for agentic engineering, released on April 7, 2026 under the MIT license. It is a 754-billion parameter Mixture-of-Experts model with 40 billion active parameters per token, a 202,752-token context window, and up to 131K output tokens. The model is the direct successor to GLM-5, designed specifically for long-horizon autonomous [&hellip;]</p>
LLM API Provider Performance KPIs 101: TTFT, Throughput & End-to-End GoalsLLM API Provider Performance KPIs 101: TTFT, Throughput & End-to-End Goals<p>Fast, predictable responses turn a clever demo into a dependable product. If you’re building on an LLM API provider like DeepInfra, three performance ideas will carry you surprisingly far: time-to-first-token (TTFT), throughput, and an explicit end-to-end (E2E) goal that blends speed, reliability, and cost into something users actually feel. This beginner-friendly guide explains each KPI [&hellip;]</p>