We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Qwen3.5 35B A3B API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
Qwen3.5 35B A3B API Benchmarks: Latency, Throughput & Cost

About Qwen3.5 35B A3B

Qwen3.5 35B A3B is a native vision-language model released by Alibaba Cloud in February 2026. It uses a hybrid architecture that integrates Gated Delta Networks with a sparse Mixture-of-Experts model, achieving higher inference efficiency. With 35 billion total parameters and only 3 billion activated per token through 256 experts (8 routed + 1 shared), it outperforms previous-generation models more than 6x its size.

The model supports a 262k token context window (extensible to 1M via YaRN), dual thinking and non-thinking modes, tool calling, and 201 languages and dialects. Qwen3.5-Flash is the hosted API version corresponding to Qwen3.5-35B-A3B, offering additional production features including 1M context length by default and official built-in tools. The model is released under the Apache 2.0 license.

Qwen3.5 35B A3B is being offered by multiple providers, but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Qwen3.5 35B A3B (Reasoning) API Review Summary

  • DeepInfra (FP8) delivers the fastest initial response: 0.60s TTFT — the lowest latency among all tracked providers.
  • DeepInfra (FP8) achieves the best end-to-end response time: 14.86s for a 500-token output — faster than all competitors.
  • GMI (FP8) leads in raw throughput: 190 t/s — the highest sustained generation speed in the benchmark.
  • All providers offer competitive cost structures: under $0.75 per million blended tokens across the board.
  • Novita offers the lowest blended price: $0.69/1M, tied with GMI (FP8) and Alibaba Cloud.

Qwen3.5 35B A3B — Best APIs

ProviderBlended ($/1M)Speed (t/s)Latency (TTFT)E2E 500 Tokens (s)
DeepInfra (FP8)$0.711750.60s14.86s
Novita$0.691841.73s15.35s
GMI (FP8)$0.691902.39s15.57s
Alibaba Cloud$0.691622.04s17.48s

Quick Verdict: Which Qwen3.5 35B A3B Provider is Best?

Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale Qwen3.5 35B A3B deployment. Its unmatched TTFT of 0.60s and best-in-class end-to-end response time (14.86s) make it the top choice for interactive and user-facing applications. For workloads prioritising maximum raw throughput, GMI (FP8) leads at 190 t/s. For the most cost-sensitive deployments, Novita, GMI, and Alibaba Cloud are all tied at $0.69/1M blended.

Overall Winner: DeepInfra (FP8)

DeepInfra stands out as the overall recommended provider, delivering the lowest latency and the fastest end-to-end response time among all evaluated options.

  • Blended Price: $0.71 / 1M tokens
  • Output Speed: 175 t/s
  • Latency (TTFT): 0.60s (fastest in the benchmark)
  • End-to-End (500 tokens): 14.86s (fastest in the benchmark)

While its token generation pace remains highly competitive at 175 t/s, it is the exceptionally brief TTFT of 0.60s that makes DeepInfra the superior choice for interactive applications such as real-time assistants, conversational AI, and coding tools. Its blended price of $0.71/1M is marginally above the $0.69 floor, but the performance advantage more than justifies the difference for user-facing workloads.

Best for Throughput: GMI (FP8)

For workloads where the primary requirement is the highest volume of tokens generated per second, GMI (FP8) leads the benchmark.

  • Blended Price: $0.69 / 1M tokens
  • Output Speed: 190 t/s (fastest in the benchmark)
  • Latency (TTFT): 2.39s
  • End-to-End (500 tokens): 15.57s

GMI’s 190 t/s throughput is the highest measured, making it the natural choice for batch processing, offline data generation, or summarization tasks where the initial latency is not a critical constraint. Its TTFT of 2.39s, however, makes it less suitable for real-time user-facing applications where perceived responsiveness matters.

Best Balanced Option: Novita

Novita offers a compelling middle ground between generation speed and initial responsiveness, making it a versatile option for mixed workloads.

  • Blended Price: $0.69 / 1M tokens
  • Output Speed: 184 t/s
  • Latency (TTFT): 1.73s
  • End-to-End (500 tokens): 15.35s

Novita ranks as a strong runner-up in both throughput (184 t/s, #2) and end-to-end response time (15.35s, #2). Combined with the lowest blended price in the benchmark ($0.69/1M), it is an excellent choice for developers who need a reliable middle ground between DeepInfra’s interactivity and GMI’s raw throughput, without paying a premium.

First-Party Provider: Alibaba Cloud

As the model creator, Alibaba Cloud offers a reliable first-party baseline with full production support via the Qwen3.5-Flash hosted API.

  • Blended Price: $0.69 / 1M tokens
  • Output Speed: 162 t/s
  • Latency (TTFT): 2.04s
  • End-to-End (500 tokens): 17.48s

Alibaba Cloud’s pricing matches the market floor at $0.69/1M, but its throughput (162 t/s, lowest in the benchmark) and end-to-end response time (17.48s, slowest) make it the least performant option for pure inference speed. It remains the natural fallback for teams already in the Alibaba Cloud ecosystem or needing the extended production features of Qwen3.5-Flash.

Conclusion

Selecting the right provider for Qwen3.5 35B A3B comes down to your application’s primary bottleneck. For interactive, user-facing applications where every millisecond counts, DeepInfra’s unmatched TTFT (0.60s) and end-to-end performance (14.86s) make it the standout choice. For batch workloads requiring maximum sustained throughput, GMI (FP8) leads at 190 t/s. For teams needing a cost-efficient balance of both, Novita delivers strong performance across the board at the market’s lowest price.

  • Choose DeepInfra (FP8) for the best overall responsiveness — lowest latency and fastest end-to-end response times.
  • Choose GMI (FP8) for maximum raw throughput in batch processing or offline workloads.
  • Choose Novita for the best balance of speed, latency, and cost at $0.69/1M.
  • Choose Alibaba Cloud for first-party support or access to extended production features via Qwen3.5-Flash.
Related articles
What Is Google TurboQuant and What Does It Mean for Open Source Inference? - Deep InfraWhat Is Google TurboQuant and What Does It Mean for Open Source Inference? - Deep Infra<p>In late March 2026, Google Research published a paper that got more attention outside of academic circles than most AI research does. TurboQuant, a new compression algorithm for the key-value cache in large language models, landed with enough noise that Cloudflare CEO Matthew Prince called it Google&#8217;s DeepSeek moment. The Silicon Valley Pied Piper comparisons [&hellip;]</p>
Kimi K2.6 API Benchmarks: Latency, TPS & Cost Analysis (2026)Kimi K2.6 API Benchmarks: Latency, TPS & Cost Analysis (2026)<p>About Kimi K2.6 Kimi K2.6 is an open-source frontier model from Moonshot AI, released on April 20, 2026. It is a native multimodal agentic model built for long-horizon coding, autonomous execution, and swarm-based task orchestration. The model uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters per token, using [&hellip;]</p>
OpenClaw Cost Optimization: Cut AI API Costs by 90%OpenClaw Cost Optimization: Cut AI API Costs by 90%<p>A single ask in an OpenClaw session can cost more than a full evening of casual ChatGPT use. Ask your agent something simple, like which calendar event clashes with your flight, and the request that hits the API carries far more than your 12-token question. It also carries your SOUL.md, the tool schemas registered on [&hellip;]</p>