DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Qwen3.5 122B A10B is Alibaba Cloud’s mid-tier multimodal foundation model, released in February 2026. It is a multimodal vision-language Mixture-of-Experts model supporting text, image, and video inputs, designed for native multimodal agent applications. It features 122 billion total parameters with 10 billion activated per token through a hybrid architecture that integrates Gated Delta Networks with sparse Mixture-of-Experts across 256 experts — delivering high-throughput inference with minimal latency and cost overhead.
The model supports a 262k token context window (extensible to 1M via YaRN), operates in both thinking and non-thinking modes, and offers expanded support for 201 languages and dialects. Qwen3.5 122B A10B scores 42 on the Artificial Analysis Intelligence Index — well above average among comparable models — and is released under the Apache 2.0 license, enabling commercial use and third-party hosting.
Qwen3.5 122B A10B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Blended ($/1M) | Input ($/1M) | Output ($/1M) | Speed (t/s) | Latency (TTFT) | E2E Response (s) | Func | JSON | Context |
|---|---|---|---|---|---|---|---|---|---|
| DeepInfra (FP8) | $0.94 | $0.29 | $2.90 | 155.5 | 0.59s | 16.67 / 12.86 | Yes | No | 262k |
| Alibaba Cloud | $1.10 | $0.40 | $3.20 | 137.9 | 2.32s | 20.44 / 14.50 | Yes | Yes | 262k |
| Novita | $1.10 | $0.40 | $3.20 | 83.8 | 1.72s | 31.56 / 23.87 | Yes | Yes | 262k |
| GMI (FP8) | $1.10 | $0.40 | $3.20 | 90.7 | 2.42s | 29.98 / 22.04 | Yes | Yes | 262k |
Based on benchmarks across 4 tracked providers, DeepInfra (FP8) is the recommended API for production-scale Qwen3.5 122B A10B deployment. It ranks #1 across all three primary metrics — speed, latency, and cost — while undercutting the market price by 17%. The only trade-off is the absence of JSON mode, which is worth noting for structured output workflows.
DeepInfra’s FP8 implementation dominates across all key performance and pricing metrics, making it the clear recommendation for the vast majority of production use cases.
At $0.94 per 1M blended tokens, DeepInfra is approximately 17% cheaper than every other provider in the benchmark — all of which are priced at $1.10. Combined with the fastest output speed and the lowest TTFT in the field, it is the only provider that wins across all three critical dimensions simultaneously.
The one trade-off worth flagging: DeepInfra (FP8) does not currently support JSON mode. Developers requiring deterministic structured outputs should either use prompt engineering to enforce JSON structure or consider Alibaba Cloud as an alternative.
As the model’s creator, Alibaba Cloud offers a solid balance of performance and full feature support, making it the natural fallback for teams requiring JSON mode.
Alibaba Cloud delivers competitive throughput (137.9 t/s) with complete feature support including JSON mode. Its latency (2.32s TTFT) is notably higher than DeepInfra, making it less suitable for real-time interactive applications. For batch workloads or structured output pipelines where JSON mode is required, it is the recommended alternative to DeepInfra.
Both Novita and GMI are priced identically at $1.10/1M blended and offer full feature support (Function Calling + JSON Mode), but neither matches DeepInfra on performance.
For developers already integrated into either ecosystem or with specific regional availability requirements, both are viable options. However, given that DeepInfra outperforms both on every metric while costing 17% less, neither represents the optimal choice for new deployments.
For the vast majority of Qwen3.5 122B A10B deployments, DeepInfra (FP8) is the clear recommendation. It ranks #1 in speed, latency, and cost simultaneously — a rare combination in inference provider benchmarks.
Getting StartedGetting an API Key
To use DeepInfra's services, you'll need an API key. You can get one by signing up on our platform.
Sign up or log in to your DeepInfra account at deepinfra.com
Navigate to the Dashboard and select API Keys
Create a new ...
Reliable JSON-Only Responses with DeepInfra LLMs<p>When large language models are used inside real applications, their role changes fundamentally. Instead of chatting with users, they become infrastructure components: extracting information, transforming text, driving workflows, or powering APIs. In these scenarios, natural language is no longer the desired output. What applications need is structured data — and very often, that structure is […]</p>
Qwen3.5 35B A3B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 35B A3B Qwen3.5 35B A3B is a native vision-language model released by Alibaba Cloud in February 2026. It uses a hybrid architecture that integrates Gated Delta Networks with a sparse Mixture-of-Experts model, achieving higher inference efficiency. With 35 billion total parameters and only 3 billion activated per token through 256 experts (8 routed […]</p>
© 2026 DeepInfra. All rights reserved.