We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Qwen3.5 9B API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
Qwen3.5 9B API Benchmarks: Latency, Throughput & Cost

About Qwen3.5 9B

Qwen3.5 9B is the flagship of Alibaba’s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes a 3:1 ratio of linear attention to full attention, maintaining a 262,144-token context window while remaining efficient enough to run on standard hardware.

Unlike previous generations that added vision capabilities post-hoc, Qwen3.5 9B was trained using early fusion on multimodal tokens, allowing the model to process visual and textual tokens within the same latent space from the start of training. This results in better spatial reasoning, improved OCR accuracy, and more cohesive visual-grounded responses. The model’s performance is largely attributed to Scaled Reinforcement Learning, which optimizes for correct reasoning paths rather than mimicking high-quality text — producing improved instruction following, fewer hallucinations, and higher reliability in fact-retrieval and mathematical reasoning.

Qwen3.5 9B is released under the Apache 2.0 license, enabling commercial use and fine-tuning. It is now being offered by different providers. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Qwen3.5 9B (Reasoning) API Review Summary

  • DeepInfra (FP8) is the fastest provider: 205.7 t/s vs Together.ai at 92.3 t/s — approximately 2.2x higher throughput.
  • DeepInfra (FP8) is the lowest-cost option: $0.08 blended / 1M tokens vs $0.11, a ~1.4x price spread.
  • DeepInfra has the cheapest input pricing: $0.04 / 1M input tokens vs Together.ai’s $0.10 — especially beneficial for long-context (10k input token) workloads.
  • DeepInfra has the fastest end-to-end response time: 13.19s vs Together.ai’s 27.84s for a 500-token output.
  • Together.ai wins on TTFT: 0.75s vs DeepInfra’s 1.04s — the only metric where Together.ai leads.
  • Both providers support function/tool calling and the full 262k context window.

Qwen3.5 9B (Reasoning) — Best APIs

ProviderQuant.Blended ($/1M)Input ($/1M)Output ($/1M)Speed (t/s)TTFT (s)E2E (s)ContextWhy Notable
DeepInfra (FP8)FP8$0.08$0.04$0.20205.71.04s13.19 / 9.72262kBest throughput + blended cost; best for long inputs and fastest generation
Together.ai (FP8)FP8$0.11$0.10$0.1592.30.75s27.84 / 21.67262kBest TTFT latency; slower throughput and higher blended cost

Quick Verdict: Which Qwen3.5 9B Provider is Best?

Based on benchmarks across 2 tracked providers, DeepInfra is the recommended API for production-scale Qwen3.5 9B deployment. It delivers 2.2x faster output speed, the lowest blended price ($0.08/1M), and resolves tasks in less than half the end-to-end time of Together.ai. Together.ai remains a viable alternative for highly interactive, conversational applications where sub-second TTFT (0.75s) is the primary requirement.

Output Speed: DeepInfra Leads by 2.2x

Output speed measures how quickly tokens are generated after the model begins its response — the primary metric for throughput-intensive tasks.

  • DeepInfra: 205.7 t/s
  • Together.ai: 92.3 t/s

DeepInfra operates at approximately 2.2x the speed of Together.ai. For applications generating long-form content, analyzing large datasets, or requiring rapid data extraction, this throughput advantage translates directly into reduced wait times. The gap is large enough to be decisive for any workload where generation volume is the primary bottleneck.

Latency: Together.ai Has the Edge

TTFT measures the initial responsiveness of an application. For reasoning models like Qwen3.5 9B, this includes the model’s internal thinking time before outputting the first user-facing answer token.

  • Together.ai: 0.75s (sub-second)
  • DeepInfra: 1.04s

Together.ai wins the latency category with a sub-second TTFT of 0.75s. For highly interactive applications — real-time chatbots or voice-to-text assistants — this edge creates a snappier perceived experience. DeepInfra at 1.04s is still highly performant and will be imperceptible to most users in practice, but the 290ms gap is measurable and relevant for latency-critical applications.

Cost Efficiency: DeepInfra Is Cheaper Across the Board

Pricing is evaluated per 1 million tokens, with the blended rate assuming a standard 3:1 input-to-output ratio.

  • Blended Price: DeepInfra $0.08 vs Together.ai $0.11 — DeepInfra is 27% cheaper overall.
  • Input Price: DeepInfra $0.04 vs Together.ai $0.10 — DeepInfra is 60% cheaper on input tokens.
  • Output Price: Together.ai $0.15 vs DeepInfra $0.20 — Together.ai is cheaper on output tokens only.

Because most reasoning and RAG workloads are heavily weighted toward input tokens (large system prompts, document context, retrieval results), DeepInfra’s aggressively priced input tier ($0.04/1M) makes it the more cost-effective choice for the vast majority of real-world usage patterns. Together.ai’s cheaper output pricing ($0.15 vs $0.20) only becomes advantageous for workloads with very short inputs and very long outputs — a less common pattern for reasoning models.

End-to-End Response Time: DeepInfra Is Nearly 2x Faster

End-to-end response time combines initial latency, reasoning time, and output speed to measure the complete lifecycle of a request — specifically, how long it takes to deliver a 500-token response from a 10,000 input token prompt.

  • DeepInfra: 13.19s
  • Together.ai: 27.84s

DeepInfra resolves tasks in less than half the time of Together.ai. Despite Together.ai’s slight TTFT advantage, DeepInfra’s 2.2x throughput lead entirely eclipses that edge when measuring total task completion time. For any workload beyond a single short exchange, DeepInfra delivers a substantially faster experience end-to-end.

Context Window and API Features

Both providers support the full 262,144-token (262k) context window natively available to Qwen3.5 9B, and both fully support Function (Tool) Calling. This means provider selection can rest entirely on performance and pricing metrics — neither provider imposes a technical ceiling on what you can build.

Conclusion

For the vast majority of Qwen3.5 9B deployments, DeepInfra is the recommended provider. With 205.7 t/s output speed, an end-to-end response time of just 13.19s, and the lowest blended price on the market at $0.08 per million tokens, DeepInfra delivers an unmatched combination of speed and cost-effectiveness.

  • Choose DeepInfra for the best overall value — fastest throughput, lowest cost, and best end-to-end response times.
  • Choose Together.ai strictly for highly interactive applications where sub-second TTFT (0.75s) is the primary architectural requirement.
Related articles
MiniMax-M2.5 API Benchmarks: Latency, Throughput & CostMiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost<p>About MiniMax-M2.5 MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through [&hellip;]</p>
Qwen3.5 2B via DeepInfra: Latency, Throughput & CostQwen3.5 2B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 2B (Reasoning) Qwen3.5 2B is a compact 2-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud&#8217;s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural [&hellip;]</p>
Best API for Kimi K2.5: Why DeepInfra Leads in Speed, TTFT, and ScalabilityBest API for Kimi K2.5: Why DeepInfra Leads in Speed, TTFT, and Scalability<p>Kimi K2.5 is positioned as Moonshot AI’s “do-it-all” model for modern product workflows: native multimodality (text + vision/video), Instant vs. Thinking modes, and support for agentic / multi-agent (“swarm”) execution patterns. In real applications, though, model capability is only half the story. The provider’s inference stack determines the things your users actually feel: time-to-first-token (TTFT), [&hellip;]</p>