We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

FLUX.2 is live! High-fidelity image generation made simple.

Qwen logo

Qwen/

Qwen3-Max

$1.20 in $6.00 out $0.24 cached / 1M tokens

*

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

Partner
Public
256,000
JSON
Function
Qwen/Qwen3-Max cover image
Qwen/Qwen3-Max cover image
Qwen3-Max

Ask me anything

0.00s

Settings

Model Information

We are thrilled to introduce Qwen3-Max — our largest and most capable model to date. It ranks amongst the best on the Text Arena leaderboard, surpassing GPT-5-Chat. The official release further enhances performance in coding and agent capabilities, achieving state-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding. We invite you to try Qwen3-Max and explore it.

The Qwen3-Max model has over 1 trillion parameters and was pretrained on 36 trillion tokens. Its architecture follows the design paradigm of the Qwen3 series, incorporating our proposed global-batch load balancing loss.

  • Training Stability: Thanks to the MoE (Mixture of Experts) architecture design of Qwen3, the pretraining loss curve of Qwen3-Max remains consistently smooth and stable throughout training. The entire training process proceeded seamlessly without any loss spikes, eliminating the need for strategies such as training rollback or adjustments to data distribution.

  • Training Efficiency: Optimized by PAI-FlashMoE’s efficient multi-level pipeline parallelism strategy, the training efficiency of Qwen3-Max-Base significantly improved, achieving a 30% relative increase in MFU (Model FLOPs Utilization) compared to Qwen2.5-Max-Base. For long-context training scenarios, we further employed our ChunkFlow strategy, which delivered a 3x throughput improvement over context parallelism, enabling training with a 1M-token context length for Qwen3-Max. Additionally, through multiple techniques including SanityCheck, EasyCheckpoint, and scheduling pipeline optimizations, the time loss caused by hardware failures on ultra-large-scale clusters was reduced to one-fifth of that observed during Qwen2.5-Max training.