Qwen3.5 9B API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About Qwen3.5 9B

Qwen3.5 9B is the flagship of Alibaba’s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes a 3:1 ratio of linear attention to full attention, maintaining a 262,144-token context window while remaining efficient enough to run on standard hardware.

Unlike previous generations that added vision capabilities post-hoc, Qwen3.5 9B was trained using early fusion on multimodal tokens, allowing the model to process visual and textual tokens within the same latent space from the start of training. This results in better spatial reasoning, improved OCR accuracy, and more cohesive visual-grounded responses. The model’s performance is largely attributed to Scaled Reinforcement Learning, which optimizes for correct reasoning paths rather than mimicking high-quality text — producing improved instruction following, fewer hallucinations, and higher reliability in fact-retrieval and mathematical reasoning.

Qwen3.5 9B is released under the Apache 2.0 license, enabling commercial use and fine-tuning. It is now being offered by different providers. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Qwen3.5 9B (Reasoning) API Review Summary

DeepInfra (FP8) is the fastest provider: 205.7 t/s vs Together.ai at 92.3 t/s — approximately 2.2x higher throughput.
DeepInfra (FP8) is the lowest-cost option: $0.08 blended / 1M tokens vs $0.11, a ~1.4x price spread.
DeepInfra has the cheapest input pricing: $0.04 / 1M input tokens vs Together.ai’s $0.10 — especially beneficial for long-context (10k input token) workloads.
DeepInfra has the fastest end-to-end response time: 13.19s vs Together.ai’s 27.84s for a 500-token output.
Together.ai wins on TTFT: 0.75s vs DeepInfra’s 1.04s — the only metric where Together.ai leads.
Both providers support function/tool calling and the full 262k context window.

Qwen3.5 9B (Reasoning) — Best APIs

Provider	Quant.	Blended ($/1M)	Input ($/1M)	Output ($/1M)	Speed (t/s)	TTFT (s)	E2E (s)	Context	Why Notable
DeepInfra (FP8)	FP8	$0.08	$0.04	$0.20	205.7	1.04s	13.19 / 9.72	262k	Best throughput + blended cost; best for long inputs and fastest generation
Together.ai (FP8)	FP8	$0.11	$0.10	$0.15	92.3	0.75s	27.84 / 21.67	262k	Best TTFT latency; slower throughput and higher blended cost

Quick Verdict: Which Qwen3.5 9B Provider is Best?

Based on benchmarks across 2 tracked providers, DeepInfra is the recommended API for production-scale Qwen3.5 9B deployment. It delivers 2.2x faster output speed, the lowest blended price ($0.08/1M), and resolves tasks in less than half the end-to-end time of Together.ai. Together.ai remains a viable alternative for highly interactive, conversational applications where sub-second TTFT (0.75s) is the primary requirement.

Output Speed: DeepInfra Leads by 2.2x

Output speed measures how quickly tokens are generated after the model begins its response — the primary metric for throughput-intensive tasks.

DeepInfra: 205.7 t/s
Together.ai: 92.3 t/s

DeepInfra operates at approximately 2.2x the speed of Together.ai. For applications generating long-form content, analyzing large datasets, or requiring rapid data extraction, this throughput advantage translates directly into reduced wait times. The gap is large enough to be decisive for any workload where generation volume is the primary bottleneck.

Latency: Together.ai Has the Edge

TTFT measures the initial responsiveness of an application. For reasoning models like Qwen3.5 9B, this includes the model’s internal thinking time before outputting the first user-facing answer token.

Together.ai: 0.75s (sub-second)
DeepInfra: 1.04s

Together.ai wins the latency category with a sub-second TTFT of 0.75s. For highly interactive applications — real-time chatbots or voice-to-text assistants — this edge creates a snappier perceived experience. DeepInfra at 1.04s is still highly performant and will be imperceptible to most users in practice, but the 290ms gap is measurable and relevant for latency-critical applications.

Cost Efficiency: DeepInfra Is Cheaper Across the Board

Pricing is evaluated per 1 million tokens, with the blended rate assuming a standard 3:1 input-to-output ratio.

Blended Price: DeepInfra $0.08 vs Together.ai $0.11 — DeepInfra is 27% cheaper overall.
Input Price: DeepInfra $0.04 vs Together.ai $0.10 — DeepInfra is 60% cheaper on input tokens.
Output Price: Together.ai $0.15 vs DeepInfra $0.20 — Together.ai is cheaper on output tokens only.

Because most reasoning and RAG workloads are heavily weighted toward input tokens (large system prompts, document context, retrieval results), DeepInfra’s aggressively priced input tier ($0.04/1M) makes it the more cost-effective choice for the vast majority of real-world usage patterns. Together.ai’s cheaper output pricing ($0.15 vs $0.20) only becomes advantageous for workloads with very short inputs and very long outputs — a less common pattern for reasoning models.

End-to-End Response Time: DeepInfra Is Nearly 2x Faster

End-to-end response time combines initial latency, reasoning time, and output speed to measure the complete lifecycle of a request — specifically, how long it takes to deliver a 500-token response from a 10,000 input token prompt.

DeepInfra: 13.19s
Together.ai: 27.84s

DeepInfra resolves tasks in less than half the time of Together.ai. Despite Together.ai’s slight TTFT advantage, DeepInfra’s 2.2x throughput lead entirely eclipses that edge when measuring total task completion time. For any workload beyond a single short exchange, DeepInfra delivers a substantially faster experience end-to-end.

Context Window and API Features

Both providers support the full 262,144-token (262k) context window natively available to Qwen3.5 9B, and both fully support Function (Tool) Calling. This means provider selection can rest entirely on performance and pricing metrics — neither provider imposes a technical ceiling on what you can build.

Conclusion

For the vast majority of Qwen3.5 9B deployments, DeepInfra is the recommended provider. With 205.7 t/s output speed, an end-to-end response time of just 13.19s, and the lowest blended price on the market at $0.08 per million tokens, DeepInfra delivers an unmatched combination of speed and cost-effectiveness.

Choose DeepInfra for the best overall value — fastest throughput, lowest cost, and best end-to-end response times.
Choose Together.ai strictly for highly interactive applications where sub-second TTFT (0.75s) is the primary architectural requirement.

Kimi K2 0905 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2 0905 Kimi K2 0905 is a state-of-the-art large language model developed by Moonshot AI, representing a significant advancement in open-weight AI capabilities. This Mixture-of-Experts (MoE) model features 1 trillion total parameters with 32 billion activated parameters per forward pass, making it highly efficient while maintaining frontier-level performance. The model supports a 256k […]</p>

Long Context models incomingMany users requested longer context models to help them summarize bigger chunks of text or write novels with ease. We're proud to announce our long context model selection that will grow bigger in the comming weeks. Models Mistral-based models have a context size of 32k, and amazon recently r...

DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis<p>About DeepSeek V4 Pro DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek’s flagship open-weight models. The model introduces a hybrid attention […]</p>

View all