We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Qwen3.5 27B API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
Qwen3.5 27B API Benchmarks: Latency, Throughput & Cost

About Qwen3.5 27B (Reasoning)

Qwen3.5 27B is part of Alibaba Cloud’s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench Verified (72.4%).

The model features a 262,144-token native context window (extensible to 1M via YaRN), support for 201 languages, both thinking and non-thinking modes, tool calling, and multimodal input processing through early fusion training. It is released under the Apache 2.0 license, enabling commercial use and third-party hosting.

Qwen3.5 27B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Qwen3.5 27B (Reasoning) API Review Summary

  • DeepInfra (FP8) is the performance leader: #1 output speed (153.3 t/s) and #1 lowest latency (0.91s TTFT) among all 4 benchmarked providers.
  • DeepInfra is 2.6x faster than the slowest provider: output speed ranges from 153.3 t/s (DeepInfra) to 57.9 t/s (GMI).
  • DeepInfra has the lowest input token price: $0.26 / 1M input tokens (vs $0.30 for Alibaba Cloud and Novita).
  • DeepInfra is near price-parity on blended cost: $0.84/1M vs $0.82/1M for the three lower-cost providers — a 2.4% premium for significantly better performance.
  • All 4 providers support function calling; JSON mode is supported by 3 of 4 — DeepInfra does not currently support JSON mode.
  • Best end-to-end response time: DeepInfra at 17.22s / 13.04s for a 500-token output including reasoning time.

Qwen3.5 27B (Reasoning) — Best APIs

ProviderSpeed (t/s)Latency (TTFT)Blended ($/1M)Input ($/1M)Output ($/1M)ContextFuncJSONPositioning
DeepInfra (FP8)153.30.91s$0.84$0.26262kYesNoBest overall performance — fastest speed + lowest latency
Alibaba Cloud86.35.57s$0.82$0.30$2.40262kYesYesLowest blended price (tied); slower speed and higher TTFT
Novita66.75.08s$0.82$0.30$2.40262kYesYesLow blended price (tied); mid-tier speed; ~5s TTFT
GMI (FP8)57.95.51s$0.82$2.40262kYesYesLow blended price (tied); slowest output speed in benchmark

Quick Verdict: Which Qwen3.5 27B Provider is Best?

Based on benchmarks across 4 tracked providers, DeepInfra (FP8) is the recommended API for production-scale Qwen3.5 27B deployment. It delivers the fastest output speed (153.3 t/s), the lowest latency (0.91s TTFT), and the lowest input token price ($0.26/1M) — all at a blended cost of just $0.84/1M, only 2.4% above the market floor. For teams requiring JSON mode, Alibaba Cloud or Novita are the recommended alternatives at identical pricing.

Overall Winner: DeepInfra (FP8)

DeepInfra is the clear performance leader for Qwen3.5 27B, delivering industry-leading speed and latency at a near-floor price point.

  • Output Speed: 153.3 t/s (#1 — 2.6x faster than GMI)
  • Latency (TTFT): 0.91s (#1 — 5-6x lower than competitors averaging ~5.5s)
  • Blended Price: $0.84 / 1M tokens
  • Input Price: $0.26 / 1M tokens (lowest in the benchmark)
  • Context Window: 262k tokens
  • API Features: Function Calling supported; JSON mode not currently available
  • E2E Response (500 tokens): 17.22s / 13.04s

DeepInfra’s FP8 quantization delivers a sub-second TTFT (0.91s) that is 5-6x lower than competitors hovering around 5+ seconds — a decisive advantage for interactive applications where user experience depends on perceived responsiveness. The platform uses Multi-Token Prediction and Eagle speculative decoding to accelerate generation throughput, providing an OpenAI-compatible API for straightforward migration.

The one trade-off is the absence of JSON mode. Developers requiring deterministic structured outputs should use Alibaba Cloud or Novita, or rely on prompt engineering to enforce JSON structure when using DeepInfra.

First-Party Provider: Alibaba Cloud

As the creator of the Qwen model family, Alibaba Cloud offers native hosting with full feature support at the lowest blended price in the benchmark.

  • Output Speed: 86.3 t/s (#2 overall)
  • Latency (TTFT): 5.57s
  • Blended Price: $0.82 / 1M tokens (lowest, tied)
  • Input Price: $0.30 / 1M tokens
  • Output Price: $2.40 / 1M tokens
  • API Features: Function Calling + JSON Mode

Alibaba Cloud delivers the second-fastest throughput (86.3 t/s) among the four providers and full JSON mode support, making it the best cost-optimized option for structured output workloads. Its TTFT of 5.57s reflects the trade-off between cost optimisation and raw speed — acceptable for batch processing but not for real-time interactive applications.

Budget Alternative: Novita

Novita offers the lowest blended price (tied) with full feature support, making it a solid option for cost-sensitive deployments that can tolerate moderate latency.

  • Output Speed: 66.7 t/s
  • Latency (TTFT): 5.08s
  • Blended Price: $0.82 / 1M tokens (lowest, tied)
  • Input Price: $0.30 / 1M tokens
  • Output Price: $2.40 / 1M tokens
  • API Features: Function Calling + JSON Mode

Novita matches Alibaba Cloud on price and features, with slightly better TTFT (5.08s vs 5.57s) but lower throughput (66.7 t/s vs 86.3 t/s). It is a viable choice for teams seeking the lowest blended cost with full feature support who are running non-interactive, batch-oriented workloads.

Budget Option: GMI (FP8)

GMI offers FP8 quantization at the market floor price, but its performance metrics trail the other three providers significantly.

  • Output Speed: 57.9 t/s (#4 — slowest in the benchmark)
  • Latency (TTFT): 5.51s
  • Blended Price: $0.82 / 1M tokens (lowest, tied)
  • Output Price: $2.40 / 1M tokens
  • API Features: Function Calling + JSON Mode

GMI delivers the lowest throughput in the benchmark (57.9 t/s) and high latency (5.51s TTFT) at the same $0.82/1M floor price as Alibaba Cloud and Novita. It is difficult to recommend over those alternatives at the same price point, unless GMI offers a specific regional availability or redundancy benefit for a particular deployment.

Conclusion

For most production deployments of Qwen3.5 27B (Reasoning), DeepInfra (FP8) is the recommended provider. Its combination of industry-leading speed (153.3 t/s), sub-second latency (0.91s TTFT), and lowest input token pricing ($0.26/1M) delivers the strongest overall value proposition — at a blended cost only 2.4% above the market floor.

  • Choose DeepInfra (FP8) for the best overall performance — fastest speed, lowest latency, and competitive pricing.
  • Choose Alibaba Cloud for cost-optimised batch workloads requiring JSON mode, with the best throughput among the budget options (86.3 t/s).
  • Choose Novita for the lowest blended cost with full feature support at slightly lower throughput than Alibaba Cloud.
  • Choose GMI (FP8) only if geographic or ecosystem-specific requirements make the other three providers unavailable.
Related articles
NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & CostNVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Super 120B A12B NVIDIA&#8217;s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging. The model uses a [&hellip;]</p>
Qwen3.5 122B A10B API Benchmarks: Latency, Throughput & CostQwen3.5 122B A10B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 122B A10B Qwen3.5 122B A10B is Alibaba Cloud&#8217;s mid-tier multimodal foundation model, released in February 2026. It is a multimodal vision-language Mixture-of-Experts model supporting text, image, and video inputs, designed for native multimodal agent applications. It features 122 billion total parameters with 10 billion activated per token through a hybrid architecture that integrates [&hellip;]</p>
Kimi K2.5 API Benchmarks: Latency, Throughput & CostKimi K2.5 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2.5 Kimi K2.5 is Moonshot AI&#8217;s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 [&hellip;]</p>