Qwen3.5 0.8B API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About Qwen3.5 0.8B (Reasoning)

Qwen3.5 0.8B is part of Alibaba Cloud’s Qwen3.5 Small Model Series, released on March 2, 2026. Designed under the philosophy of “More Intelligence, Less Compute,” it targets edge devices, mobile phones, and low-latency applications where battery life and memory constraints are critical. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a 3:1 ratio of linear to full attention layers) with sparse Mixture-of-Experts, enabling high output quality while controlling memory growth — supporting a 262,000-token context window despite its compact footprint.

Unlike earlier small models that added vision capabilities post-hoc, Qwen3.5 0.8B features native multimodal capabilities through early fusion training on multimodal tokens. The model supports 201 languages and dialects, uses extended chain-of-thought reasoning to work through complex problems before providing an answer, and supports function calling for agentic workflows. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats, and is released under the Apache 2.0 license enabling commercial use and fine-tuning.

Qwen3.5 0.8B is now available via DeepInfra — this analysis breaks down the key performance metrics developers need to evaluate before deploying.

Qwen3.5 0.8B (Reasoning) API Review Summary

DeepInfra (FP8) is the only benchmarked provider for Qwen3.5 0.8B (Reasoning), leading across all key metrics: speed, latency, and price.
Fastest output speed: 403.5 tokens/sec (P50 median over the past 72 hours, 10,000 input token workload).
Lowest latency: 0.37s TTFT — sub-half-second initial response.
Lowest blended price: $0.02 per 1M tokens (3:1 input:output blend).
Lowest token rates: $0.01 / 1M input tokens and $0.05 / 1M output tokens.
End-to-end response time: 6.56s for a 500-token output (thinking time: 4.96s; answer generation: ~1.60s).
Context window: 262k tokens.
Function Calling and JSON Mode: Both supported.

Quick Summary of DeepInfra

DeepInfra is the only provider for Qwen3.5 0.8B deployment. It delivers 403.5 t/s output speed, a 0.37s TTFT, and a blended price of $0.02/1M tokens. The combination of sub-half-second latency, high throughput, and native JSON mode and function calling support makes it well suited for both real-time and batch workloads.

Latency: 0.37s Time to First Token

For interactive AI applications, chatbots, and real-time agentic workflows, TTFT is the most critical user-facing metric. DeepInfra records a median TTFT of 0.37 seconds — measured after processing a 10,000 input token workload, which for a reasoning model includes initial input processing and generation of the first reasoning token.

A sub-half-second TTFT effectively eliminates cold start delays for real-time applications. It is the recommended inference choice for applications requiring immediate perceived responsiveness, from conversational interfaces to coding assistants.

Output Speed: 403.5 Tokens per Second

Inference output speed dictates how quickly a model can stream its generated response after the first token is received. DeepInfra achieves 403.5 tokens per second — a sustained P50 measurement over a 72-hour period.

At 403.5 t/s, a standard 500-token response is generated in approximately 1.2 seconds. For throughput-intensive tasks such as bulk summarization, automated report generation, long-form content creation, or complex programmatic reasoning, this generation speed ensures token output never becomes a pipeline bottleneck.

End-to-End Response Time: 6.56 Seconds for 500 Tokens

End-to-end response time provides the most accurate view of total API transaction duration. DeepInfra completes a full 500-token output in 6.56 seconds, composed of the 0.37s TTFT, a 4.96-second internal reasoning time, and approximately 1.60 seconds of pure output time.

This predictable and stable E2E latency prevents client-side request timeouts during multi-step prompt executions and makes it well suited for complex, multi-step agentic workflows.

Cost Efficiency: $0.02 Blended Price per 1M Tokens

DeepInfra offers the following pricing for Qwen3.5 0.8B inference:

Input Price: $0.01 per 1M tokens
Output Price: $0.05 per 1M tokens
Blended Price: $0.02 per 1M tokens (3:1 input:output ratio)

The heavily discounted input pricing ($0.01/1M) makes it particularly cost-effective for RAG architectures, where large context payloads are sent to the API prior to generation. For high-volume deployments processing millions of tokens per day, this pricing structure delivers strong operational economics.

Context Window and API Features

DeepInfra’s deployment of Qwen3.5 0.8B supports a 262k token context window alongside native Function Calling (Tool Use) and JSON Mode. A 262k context limit allows developers to pass hundreds of pages of documentation, extensive codebases, or long conversation histories in a single API request. Native function calling and JSON mode support enables the model to reliably trigger external APIs, return structured outputs, and interact with complex agentic workflows.

Frequently Asked Questions

What is the cheapest API for Qwen3.5 0.8B?

DeepInfra (FP8) offers the lowest pricing at $0.01 per 1M input tokens and $0.05 per 1M output tokens, with a blended rate of $0.02 per 1M tokens.

How fast is Qwen3.5 0.8B’s time-to-first-token?

On DeepInfra (FP8), the median TTFT is 0.37 seconds on a 10,000 input token workload, measured as P50 over 72 hours.

What is the context window for Qwen3.5 0.8B?

The model supports a 262,000-token (262k) context window, enabling extensive RAG use cases and processing of large documents or codebases.

Does Qwen3.5 0.8B support function calling?

Yes. DeepInfra’s API provides native support for both function (tool) calling and JSON mode, making it suitable for autonomous agent development.

What is the output speed of Qwen3.5 0.8B on DeepInfra?

DeepInfra (FP8) delivers 403.5 tokens per second, allowing a standard 500-token response to be generated in approximately 1.2 seconds.

Can I run Qwen3.5 0.8B locally?

Yes. The model is available under the Apache 2.0 license on Hugging Face and ModelScope. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats via llama.cpp or Ollama.

Conclusion

For developers deploying Qwen3.5 0.8B (Reasoning), DeepInfra (FP8) is the way to go. It combines a sub-half-second TTFT (0.37s), high output throughput (403.5 t/s), and a blended price of just $0.02 per million tokens — delivering strong performance for both latency-sensitive and throughput-intensive production workloads, with native JSON mode and function calling support included.

Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)<p>This guide walks you from zero to working: you’ll learn what OCR is (and why PDFs can be tricky), how to turn any PDF—including those with screenshots of tables—into text, and how to let an LLM do the heavy lifting to clean OCR noise, reconstruct tables, and summarize the document. We’ll use DeepInfra’s OpenAI-compatible API […]</p>

Pricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is […]</p>

DeepInfra Raises $107M Series B to Scale Inference InfrastructureDeepInfra has raised $107 million in Series B funding to scale its inference cloud, expand global capacity, and support the next generation of open-source and agentic AI workloads.

View all