Qwen3 Coder 480B A35B API Benchmarks: Latency & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About Qwen3 Coder 480B A35B Instruct

Qwen3 Coder 480B A35B Instruct is a state-of-the-art large language model developed by the Qwen team at Alibaba Cloud, specifically designed for code generation and agentic coding tasks. It is a Mixture-of-Experts (MoE) model with 480 billion total parameters and 35 billion active parameters per inference, enabling high performance at lower computational cost compared to dense models of similar scale.

The model supports a native context length of 256K tokens (extendable to 1M via YaRN interpolation) and excels in agentic coding, browser-use, and tool-use tasks — achieving results comparable to Claude Sonnet 4. It was trained on 7.5 trillion high-quality tokens with a 70% code ratio across 358 programming languages, and its post-training phase leverages long-horizon reinforcement learning (Agent RL) to improve multi-step planning and interaction with external tools.

Qwen3 Coder 480B A35B Instruct is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Qwen3 Coder 480B A35B Instruct API Review Summary

DeepInfra (Turbo, FP4) is the overall recommended provider: #1 lowest blended price ($0.41/1M) and tied-#1 lowest latency (0.60s TTFT).
DeepInfra (FP8) offers a step up in throughput: 81.1 t/s vs 42 t/s for Turbo, at $0.70/1M blended — a balanced option for mixed workloads.
Eigen AI leads on throughput: 265.7 t/s and the fastest E2E time (3.20s for 500 tokens), but lacks Function Calling.
Google Vertex is the best-rounded high-speed option: 172.6 t/s, 0.69s TTFT, $0.61/1M, with full JSON Mode and Function Calling support.
Price spread: DeepInfra Turbo at $0.41/1M vs Alibaba Cloud at $3.00/1M — a 7.3x range across the 10 providers.
Feature coverage: Function Calling supported by 9 of 10 providers (Eigen AI is the exception); JSON mode by 7 of 10 (DeepInfra variants and Amazon Bedrock do not support it).

Qwen3 Coder 480B A35B Instruct — Best APIs

Provider	Best For	Blended ($/1M)	TTFT (s)	Speed (t/s)	E2E (s)	Context	Func	JSON
DeepInfra (Turbo, FP4)	Lowest cost + lowest latency; best value for interactive and cost-sensitive apps	$0.41	0.60s	42	12.59s	262k	Yes	No
DeepInfra (FP8)	Balanced low latency + mid price; faster throughput than Turbo	$0.70	0.87s	81.1	7.04s	262k	Yes	No
Google Vertex	Strong speed/latency balance with JSON mode support	$0.61	0.69s	172.6	3.58s	262k	Yes	Yes
Amazon Bedrock	Low price tier with solid throughput; lacks JSON mode	$0.61	1.82s	99.7	6.84s	262k	Yes	No
Eigen AI	Maximum throughput; fastest E2E time; lacks function calling	$0.61	1.32s	265.7	3.20s	262k	No	Yes

Quick Verdict: Which Qwen3 Coder 480B Provider is Best?

Based on benchmarks across 10 tracked providers, DeepInfra (Turbo, FP4) is the recommended API for production-scale Qwen3 Coder 480B deployment. It offers the lowest blended price ($0.41/1M), tied-lowest TTFT (0.60s), and full Function Calling support. For teams requiring higher throughput, DeepInfra (FP8) at $0.70/1M provides a strong step up to 81.1 t/s. For maximum raw speed with JSON mode, Google Vertex at $0.61/1M delivers 172.6 t/s and 0.69s TTFT.

Overall Winner: DeepInfra (Turbo, FP4)

DeepInfra (Turbo, FP4) offers the strongest overall balance of latency, cost, and feature support for Qwen3 Coder 480B deployments.

Blended Price: $0.41 / 1M tokens (#1 lowest)
Input Price: $0.22 / 1M tokens (tied lowest)
Output Price: $1.00 / 1M tokens (#1 lowest)
Latency (TTFT): 0.60s (tied #1)
Output Speed: 42 t/s
End-to-End (500 tokens): 12.59s
Context Window: 262k tokens
API Features: Function Calling supported; JSON mode not currently available

The Turbo FP4 variant’s near-instantaneous response (0.60s TTFT) makes it ideal for interactive coding assistants and real-time agentic workflows. Its pricing undercuts the next cheapest option (Novita at $0.55) by 25%, making it the strongest choice for cost-sensitive production workloads. The trade-off is lower throughput (42 t/s) and no JSON mode — developers requiring structured outputs should opt for DeepInfra FP8 or Google Vertex.

Balanced Alternative: DeepInfra (FP8)

DeepInfra (FP8) provides a meaningful throughput upgrade over the Turbo variant at a moderate price increase.

Blended Price: $0.70 / 1M tokens
Latency (TTFT): 0.87s
Output Speed: 81.1 t/s (~2x faster than Turbo FP4)
End-to-End (500 tokens): 7.04s
API Features: Function Calling supported; JSON mode not currently available

At $0.70/1M, DeepInfra FP8 sits in the middle of the pricing range while delivering nearly double the throughput of the Turbo variant. Its 7.04s E2E time is considerably faster than the Turbo’s 12.59s, making it the better choice for workloads that mix interactive use with moderate content generation volume.

Best High-Speed Option with Full Features: Google Vertex

Google Vertex offers the best combination of speed, latency, and full API feature support among the competitive mid-price tier.

Blended Price: $0.61 / 1M tokens
Latency (TTFT): 0.69s (#3 lowest)
Output Speed: 172.6 t/s (#2 overall)
End-to-End (500 tokens): 3.58s (#2 fastest)
API Features: JSON Mode + Function Calling

Google Vertex is the only provider in the $0.61 price tier that combines low latency (0.69s), high throughput (172.6 t/s), and full feature support including JSON mode. For teams requiring structured output alongside fast generation, it is the strongest alternative to DeepInfra.

Maximum Throughput: Eigen AI

Eigen AI leads the benchmark on raw generation speed, delivering the fastest E2E response times of any provider.

Blended Price: $0.61 / 1M tokens
Latency (TTFT): 1.32s
Output Speed: 265.7 t/s (#1 — fastest in the benchmark)
End-to-End (500 tokens): 3.20s (#1 fastest)
API Features: JSON Mode supported; Function Calling not supported

Eigen AI’s 265.7 t/s throughput makes it the fastest provider for bulk code generation and long-form content. However, the absence of Function Calling makes it unsuitable for agentic workflows where the model needs to invoke external tools. It is best suited for high-volume batch generation where tool use is not required.

Enterprise Platform: Amazon Bedrock

Blended Price: $0.61 / 1M tokens
Input Price: $0.22 / 1M tokens (tied lowest)
Latency (TTFT): 1.82s
Output Speed: 99.7 t/s
API Features: Function Calling supported; JSON mode not supported

Amazon Bedrock offers competitive pricing (tied lowest input at $0.22/1M) and solid throughput (99.7 t/s), but its 1.82s TTFT is one of the higher latency figures in the benchmark and it lacks JSON mode. It is recommended only when strict AWS IAM or compliance requirements make it the necessary choice.

Conclusion

For Qwen3 Coder 480B A35B Instruct deployments, DeepInfra is the recommended provider across both its available variants — with the right choice depending on your workload priorities.

Choose DeepInfra (Turbo, FP4) for the best overall value — lowest cost ($0.41/1M), lowest latency (0.60s), and function calling support.
Choose DeepInfra (FP8) for a stronger throughput profile (81.1 t/s, 7.04s E2E) at a moderate $0.70/1M price point.
Choose Google Vertex for the best combination of speed (172.6 t/s), latency (0.69s), and full JSON Mode + Function Calling support.
Choose Eigen AI for maximum raw throughput (265.7 t/s) in batch workloads where function calling is not required.
Avoid Alibaba Cloud for most use cases — at $3.00/1M it is 7x more expensive than DeepInfra Turbo with no performance advantage.

Introducing Tool Calling with LangChain, Search the Web with Tavily and Tool Calling AgentsIn this blog post, we will query for the details of a recently released expansion pack for Elden Ring, a critically acclaimed game released in 2022, using the Tavily tool with the ChatDeepInfra model. Using this boilerplate, one can automate the process of searching for information with well-writt...

Gemma 4 Model Overview: Features, Architecture & Use Cases<p>Gemma 4 is Google DeepMind’s latest family of open-weight models, released on April 3, 2026 under the Apache 2.0 license. The family spans four model sizes — from edge-optimized variants for mobile devices to a 31B dense model for server-side deployments — with every model supporting multimodal input, built-in reasoning, and a context window of […]</p>

Fork of Text Generation Inference.The text generation inference open source project by huggingface looked like a promising framework for serving large language models (LLM). However, huggingface announced that they will change the license of code with version v1.0.0. While the previous license Apache 2.0 was permissive, the new on...

View all