GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Qwen3.5 0.8B is part of Alibaba Cloud’s Qwen3.5 Small Model Series, released on March 2, 2026. Designed under the philosophy of “More Intelligence, Less Compute,” it targets edge devices, mobile phones, and low-latency applications where battery life and memory constraints are critical. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a 3:1 ratio of linear to full attention layers) with sparse Mixture-of-Experts, enabling high output quality while controlling memory growth — supporting a 262,000-token context window despite its compact footprint.
Unlike earlier small models that added vision capabilities post-hoc, Qwen3.5 0.8B features native multimodal capabilities through early fusion training on multimodal tokens. The model supports 201 languages and dialects, uses extended chain-of-thought reasoning to work through complex problems before providing an answer, and supports function calling for agentic workflows. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats, and is released under the Apache 2.0 license enabling commercial use and fine-tuning.
Qwen3.5 0.8B is now available via DeepInfra — this analysis breaks down the key performance metrics developers need to evaluate before deploying.
DeepInfra is the only provider for Qwen3.5 0.8B deployment. It delivers 403.5 t/s output speed, a 0.37s TTFT, and a blended price of $0.02/1M tokens. The combination of sub-half-second latency, high throughput, and native JSON mode and function calling support makes it well suited for both real-time and batch workloads.
For interactive AI applications, chatbots, and real-time agentic workflows, TTFT is the most critical user-facing metric. DeepInfra records a median TTFT of 0.37 seconds — measured after processing a 10,000 input token workload, which for a reasoning model includes initial input processing and generation of the first reasoning token.
A sub-half-second TTFT effectively eliminates cold start delays for real-time applications. It is the recommended inference choice for applications requiring immediate perceived responsiveness, from conversational interfaces to coding assistants.
Inference output speed dictates how quickly a model can stream its generated response after the first token is received. DeepInfra achieves 403.5 tokens per second — a sustained P50 measurement over a 72-hour period.
At 403.5 t/s, a standard 500-token response is generated in approximately 1.2 seconds. For throughput-intensive tasks such as bulk summarization, automated report generation, long-form content creation, or complex programmatic reasoning, this generation speed ensures token output never becomes a pipeline bottleneck.
End-to-end response time provides the most accurate view of total API transaction duration. DeepInfra completes a full 500-token output in 6.56 seconds, composed of the 0.37s TTFT, a 4.96-second internal reasoning time, and approximately 1.60 seconds of pure output time.
This predictable and stable E2E latency prevents client-side request timeouts during multi-step prompt executions and makes it well suited for complex, multi-step agentic workflows.
DeepInfra offers the following pricing for Qwen3.5 0.8B inference:
The heavily discounted input pricing ($0.01/1M) makes it particularly cost-effective for RAG architectures, where large context payloads are sent to the API prior to generation. For high-volume deployments processing millions of tokens per day, this pricing structure delivers strong operational economics.
DeepInfra’s deployment of Qwen3.5 0.8B supports a 262k token context window alongside native Function Calling (Tool Use) and JSON Mode. A 262k context limit allows developers to pass hundreds of pages of documentation, extensive codebases, or long conversation histories in a single API request. Native function calling and JSON mode support enables the model to reliably trigger external APIs, return structured outputs, and interact with complex agentic workflows.
DeepInfra (FP8) offers the lowest pricing at $0.01 per 1M input tokens and $0.05 per 1M output tokens, with a blended rate of $0.02 per 1M tokens.
On DeepInfra (FP8), the median TTFT is 0.37 seconds on a 10,000 input token workload, measured as P50 over 72 hours.
The model supports a 262,000-token (262k) context window, enabling extensive RAG use cases and processing of large documents or codebases.
Yes. DeepInfra’s API provides native support for both function (tool) calling and JSON mode, making it suitable for autonomous agent development.
DeepInfra (FP8) delivers 403.5 tokens per second, allowing a standard 500-token response to be generated in approximately 1.2 seconds.
Yes. The model is available under the Apache 2.0 license on Hugging Face and ModelScope. It can run on devices with as little as 2–3 GB of RAM using GGUF quantized formats via llama.cpp or Ollama.
For developers deploying Qwen3.5 0.8B (Reasoning), DeepInfra (FP8) is the way to go. It combines a sub-half-second TTFT (0.37s), high output throughput (403.5 t/s), and a blended price of just $0.02 per million tokens — delivering strong performance for both latency-sensitive and throughput-intensive production workloads, with native JSON mode and function calling support included.
Qwen API Pricing Guide 2026: Max Performance on a Budget<p>If you have been following the AI leaderboards lately, you have likely noticed a new name constantly trading blows with GPT-4o and Claude 3.5 Sonnet: Qwen. Developed by Alibaba Cloud, the Qwen model family (specifically Qwen 2.5 and Qwen 3) has exploded in popularity for one simple reason: unbeatable price-to-performance. In 2025, Qwen is widely […]</p>
Qwen3.5 27B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 27B (Reasoning) Qwen3.5 27B is part of Alibaba Cloud’s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench […]</p>
Model Distillation Making AI Models EfficientAI Model Distillation Definition & Methodology
Model distillation is the art of teaching a smaller, simpler model to perform as well as a larger one. It's like training an apprentice to take over a master's work—streamlining operations with comparable performance . If you're struggling with depl...© 2026 Deep Infra. All rights reserved.