DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Qwen3.5 9B is the flagship of Alibaba’s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes a 3:1 ratio of linear attention to full attention, maintaining a 262,144-token context window while remaining efficient enough to run on standard hardware.
Unlike previous generations that added vision capabilities post-hoc, Qwen3.5 9B was trained using early fusion on multimodal tokens, allowing the model to process visual and textual tokens within the same latent space from the start of training. This results in better spatial reasoning, improved OCR accuracy, and more cohesive visual-grounded responses. The model’s performance is largely attributed to Scaled Reinforcement Learning, which optimizes for correct reasoning paths rather than mimicking high-quality text — producing improved instruction following, fewer hallucinations, and higher reliability in fact-retrieval and mathematical reasoning.
Qwen3.5 9B is released under the Apache 2.0 license, enabling commercial use and fine-tuning. It is now being offered by different providers. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Quant. | Blended ($/1M) | Input ($/1M) | Output ($/1M) | Speed (t/s) | TTFT (s) | E2E (s) | Context | Why Notable |
|---|---|---|---|---|---|---|---|---|---|
| DeepInfra (FP8) | FP8 | $0.08 | $0.04 | $0.20 | 205.7 | 1.04s | 13.19 / 9.72 | 262k | Best throughput + blended cost; best for long inputs and fastest generation |
| Together.ai (FP8) | FP8 | $0.11 | $0.10 | $0.15 | 92.3 | 0.75s | 27.84 / 21.67 | 262k | Best TTFT latency; slower throughput and higher blended cost |
Based on benchmarks across 2 tracked providers, DeepInfra is the recommended API for production-scale Qwen3.5 9B deployment. It delivers 2.2x faster output speed, the lowest blended price ($0.08/1M), and resolves tasks in less than half the end-to-end time of Together.ai. Together.ai remains a viable alternative for highly interactive, conversational applications where sub-second TTFT (0.75s) is the primary requirement.
Output speed measures how quickly tokens are generated after the model begins its response — the primary metric for throughput-intensive tasks.
DeepInfra operates at approximately 2.2x the speed of Together.ai. For applications generating long-form content, analyzing large datasets, or requiring rapid data extraction, this throughput advantage translates directly into reduced wait times. The gap is large enough to be decisive for any workload where generation volume is the primary bottleneck.
TTFT measures the initial responsiveness of an application. For reasoning models like Qwen3.5 9B, this includes the model’s internal thinking time before outputting the first user-facing answer token.
Together.ai wins the latency category with a sub-second TTFT of 0.75s. For highly interactive applications — real-time chatbots or voice-to-text assistants — this edge creates a snappier perceived experience. DeepInfra at 1.04s is still highly performant and will be imperceptible to most users in practice, but the 290ms gap is measurable and relevant for latency-critical applications.
Pricing is evaluated per 1 million tokens, with the blended rate assuming a standard 3:1 input-to-output ratio.
Because most reasoning and RAG workloads are heavily weighted toward input tokens (large system prompts, document context, retrieval results), DeepInfra’s aggressively priced input tier ($0.04/1M) makes it the more cost-effective choice for the vast majority of real-world usage patterns. Together.ai’s cheaper output pricing ($0.15 vs $0.20) only becomes advantageous for workloads with very short inputs and very long outputs — a less common pattern for reasoning models.
End-to-end response time combines initial latency, reasoning time, and output speed to measure the complete lifecycle of a request — specifically, how long it takes to deliver a 500-token response from a 10,000 input token prompt.
DeepInfra resolves tasks in less than half the time of Together.ai. Despite Together.ai’s slight TTFT advantage, DeepInfra’s 2.2x throughput lead entirely eclipses that edge when measuring total task completion time. For any workload beyond a single short exchange, DeepInfra delivers a substantially faster experience end-to-end.
Both providers support the full 262,144-token (262k) context window natively available to Qwen3.5 9B, and both fully support Function (Tool) Calling. This means provider selection can rest entirely on performance and pricing metrics — neither provider imposes a technical ceiling on what you can build.
For the vast majority of Qwen3.5 9B deployments, DeepInfra is the recommended provider. With 205.7 t/s output speed, an end-to-end response time of just 13.19s, and the lowest blended price on the market at $0.08 per million tokens, DeepInfra delivers an unmatched combination of speed and cost-effectiveness.
Search That Actually Works: A Guide to LLM RerankersSearch relevance isn’t a nice-to-have feature for your site or app. It can make or break the entire user experience.
When a customer searches "best laptop for video editing" and gets results for gaming laptops or budget models, they leave empty-handed.
Embeddings help you find similar content, bu...
Kimi K2.6 Model Overview: Architecture, Features & Capabilities<p>Kimi K2.6 is Moonshot AI’s latest flagship open-source model, released on April 20, 2026 under a Modified MIT license. It is a native multimodal agentic model built on a 1-trillion parameter Mixture-of-Experts (MoE) architecture, with 32 billion parameters activated per token. The model is designed for long-horizon coding, autonomous execution, and multi-agent orchestration, and is […]</p>
Best API Providers for GLM-5.1 in 2026<p>GLM-5.1 is available across a growing number of API providers, and the choice between them materially affects cost, latency, and what features you can actually use. The benchmark spread is real: blended pricing runs from $0.74 to $1.70 per 1M tokens across tracked providers, output speed ranges from 33 to 175 t/s, and not every […]</p>
© 2026 DeepInfra. All rights reserved.