FLUX.2 is live! High-fidelity image generation made simple.
Qwen/
$1.20 in $6.00 out $0.24 cached / 1M tokens
*The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

Ask me anything
Settings
We are thrilled to introduce Qwen3-Max — our largest and most capable model to date. It ranks amongst the best on the Text Arena leaderboard, surpassing GPT-5-Chat. The official release further enhances performance in coding and agent capabilities, achieving state-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding. We invite you to try Qwen3-Max and explore it.
The Qwen3-Max model has over 1 trillion parameters and was pretrained on 36 trillion tokens. Its architecture follows the design paradigm of the Qwen3 series, incorporating our proposed global-batch load balancing loss.
Training Stability: Thanks to the MoE (Mixture of Experts) architecture design of Qwen3, the pretraining loss curve of Qwen3-Max remains consistently smooth and stable throughout training. The entire training process proceeded seamlessly without any loss spikes, eliminating the need for strategies such as training rollback or adjustments to data distribution.
Training Efficiency: Optimized by PAI-FlashMoE’s efficient multi-level pipeline parallelism strategy, the training efficiency of Qwen3-Max-Base significantly improved, achieving a 30% relative increase in MFU (Model FLOPs Utilization) compared to Qwen2.5-Max-Base. For long-context training scenarios, we further employed our ChunkFlow strategy, which delivered a 3x throughput improvement over context parallelism, enabling training with a 1M-token context length for Qwen3-Max. Additionally, through multiple techniques including SanityCheck, EasyCheckpoint, and scheduling pipeline optimizations, the time loss caused by hardware failures on ultra-large-scale clusters was reduced to one-fifth of that observed during Qwen2.5-Max training.
© 2026 Deep Infra. All rights reserved.