GLM-5 API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high “thinking token” usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5’s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ context window.

GLM-5 is purpose-built for complex systems engineering and long-horizon agentic tasks. It integrates DeepSeek Sparse Attention (DSA), substantially reducing deployment costs while preserving long-context capacity. The model was trained using a novel asynchronous RL infrastructure called “Slime” that improves training throughput and efficiency.

GLM-5 achieves best-in-class performance among all open-source models on reasoning, coding, and agentic tasks, closing the gap with frontier models like Claude Opus 4.5. On SWE-bench Verified and Terminal-Bench 2.0, GLM-5 records leading open-model scores of 77.8 and 56.2, respectively.

GLM-5 is now available across multiple inference providers, but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

GLM-5 (Reasoning) API Review Summary

Most cost-effective provider (blended): DeepInfra FP8 at $1.24 / 1M tokens (3:1 input:output blend), cheaper than the next best Baseten ($1.50).
Lowest raw token rates: DeepInfra FP8 has the lowest input price ($0.80 / 1M) and lowest output price ($2.56 / 1M) among listed providers.
Competitive responsiveness for the price: DeepInfra FP8 posts 1.22s time-to-first-token (top-5 latency group) with 57 tokens/sec output speed.
Production-oriented benchmarking: Metrics shown are median (P50) over the past 72 hours, using a 10,000 input token workload to reflect real production usage.
API feature readiness: All providers support function calling, and DeepInfra FP8 supports JSON mode (JSON mode supported by 12 of 13 providers).

GLM-5 (Reasoning) — Best APIs

Provider	Best For	Blended ($/1M)	Input ($/1M)	Output ($/1M)	Speed (t/s)	Latency (TTFT)	JSON	Func Calling	Why Notable
DeepInfra FP8 (deepinfra.com)	Lowest cost / best cost baseline	$1.24	$0.80	$2.56	57	1.22s	Yes	Yes	Cheapest blended + input/output; competitive latency for interactive apps at budget.
Fireworks	Max throughput + lowest latency	$1.55	—	—	212.2	0.74s	Yes	Yes	Best raw performance (fastest output + lowest TTFT) but higher blended price.
Baseten	Strong balance (fast + low cost)	$1.50	$0.95	$3.15	183.1	0.83s	Yes	Yes	Near-top performance with 2nd-lowest blended price.
Eigen AI	High speed alternative	$1.55	—	—	204.5	1.53s	Yes	Yes	Very high output speed; blended price higher than DeepInfra.
Novita FP8	Price runner-up	$1.55	$1.00	$3.20	50	1.45s	Yes	Yes	Among lowest blended prices, but still above DeepInfra on all metrics.

Quick Verdict: Which GLM-5 Provider is Best?

Based on benchmarks of all the aforementioned providers, DeepInfra is the recommended API for production-scale GLM-5 deployment. It offers the lowest blended price on the market ($1.24 per 1M tokens), which is critical for reasoning models that generate high volumes of internal chain-of-thought tokens. For use cases requiring sub-second latency, Fireworks is the fastest provider tested.

Use Case	Recommended Provider	Why?
Chatbots / Real-time	Fireworks	Lowest Latency (0.74s)
Batch Processing / RAG	DeepInfra	Lowest Cost ($1.24/1M)
Agentic Workflows	Baseten	Balanced Speed & Cost

What is the Best GLM-5 API for Production?

While other providers chase raw burst speed, DeepInfra (FP8) secures the top recommendation by dominating the most critical metric for scaling reasoning models: Cost Efficiency.

Reasoning models like GLM-5 generate a high volume of “thinking” tokens before producing a final answer. This drastically inflates output token usage compared to standard LLMs. DeepInfra offers the lowest blended price and, crucially, the most competitive output pricing, making it the optimal choice for high-volume production environments where margins matter.

Blended Price: $1.24 per 1M tokens (Lowest on the market)
Input Price: $0.80 per 1M
Output Price: $2.56 per 1M
Latency (TTFT): 1.22s
Context Window: 203k

DeepInfra provides a competitive latency profile (tying for 4th place) while undercutting the market standard price ($1.55) by 20%. For applications requiring heavy reasoning chains, the savings on output tokens make DeepInfra the superior architectural choice.

Which GLM-5 Provider is the Fastest?

For applications requiring massive batch processing or long-form content generation where the user is waiting for the stream to finish, Output Tokens Per Second (t/s) is the governing metric.

Fireworks: The undisputed performance leader, clocking in at 212.2 t/s. Fireworks is optimized for sheer velocity — approximately 7x faster than the slowest providers in the benchmark.
Eigen AI: A close second at 204.5 t/s.
Baseten: Consistently high performance at 183.1 t/s.

Technical Note: The gap between the top tier (Fireworks/Eigen) and the mid-tier (Google at 75.7 t/s) is substantial. If your application relies on rapid text generation (e.g., code autocompletion), the premium for Fireworks or Eigen is justified.

Which GLM-5 Provider Has the Lowest Latency?

TTFT is the critical metric for conversational AI and chatbots. It measures the time between the request and the first visible character.

Fireworks: Leads the pack with 0.74s. This sub-second response time creates a real-time feeling for end-users.
Baseten: Excellent performance at 0.83s.
FriendliAI: The only other provider to break the sub-second barrier at 0.94s.

DeepInfra and Together.ai (FP4) sit in the second tier at 1.22s. While slower than Fireworks, 1.2 seconds is generally acceptable for most asynchronous reasoning tasks where the user expects a brief thinking pause.

Cheapest GLM-5 API Providers and Cost Analysis

Winner: DeepInfra (FP8)

While speed is important, the cost structure is the deciding factor for reasoning models. GLM-5 generates significant “thinking” tokens, inflating output costs.

DeepInfra: $1.24 / 1M tokens (Cheapest)
Baseten: $1.50 / 1M tokens
Market Average: $1.55 / 1M tokens

DeepInfra undercuts the market average by 20%. For high-volume reasoning chains, this price difference is the primary differentiator. The market has largely coalesced around a standard blended price of $1.55 per 1 million tokens. Deviations from this norm highlight the value leaders:

The Budget Leader: DeepInfra ($1.24) — the lowest price point available.
The Value Runner-Up: Baseten ($1.50) — a compelling middle ground, slightly cheaper than the standard rate but significantly faster (183 t/s) than budget options.
The Standard Tier ($1.55): Novita, Eigen AI, Nebius, FriendliAI, GMI, Fireworks, Parasail, Google, and Together.ai all cluster at this price point.

Feature Support: JSON Mode & Tool Calling

Technical integration is just as important as raw speed.

Function Calling: Support is ubiquitous. All 13 providers — including Novita, Nebius, and SiliconFlow — support Function/Tool calling, ensuring GLM-5 can be used as an agentic backend across any provider.
JSON Mode: 12 of the 13 providers support JSON mode, ensuring deterministic structured outputs. SiliconFlow is the exception — developers requiring strict JSON schemas should verify support before deployment, or choose a fully compliant provider like DeepInfra or Fireworks.

The Metrics at a Glance

Provider	Blended Price (/1M)	Latency (TTFT)	Output Speed (t/s)	Recommendation
DeepInfra FP8	$1.24	1.22s	~57	Best Overall (Value)
Fireworks	$1.55	0.74s	212.2	Best Performance
Baseten	$1.50	0.83s	183.1	Best Balanced
Eigen AI	$1.55	1.53s	204.5	High Throughput
FriendliAI	$1.55	0.94s	73.9	Low Latency
Google	$1.55	1.33s	75.7	Standard

Conclusion

For developers integrating GLM-5 (Reasoning) into their stack, the optimal choice depends on the specific bottleneck of the application.

If you are building a consumer-facing chatbot where every millisecond counts, Fireworks is the technically superior option. However, for the vast majority of enterprise use cases — where reasoning models are used for complex data processing, RAG pipelines, or agentic workflows — DeepInfra is the recommended provider. It offers a robust feature set (JSON mode and Function calling), acceptable latency, and a pricing structure that provides a significant long-term competitive advantage.

A short intro on running Stable Diffusion on DeepInfraI'm glad you asked

OpenClaw Cost Optimization: Cut AI API Costs by 90%<p>A single ask in an OpenClaw session can cost more than a full evening of casual ChatGPT use. Ask your agent something simple, like which calendar event clashes with your flight, and the request that hits the API carries far more than your 12-token question. It also carries your SOUL.md, the tool schemas registered on […]</p>

DeepInfra is now a supported Hugging Face Inference ProviderDeepInfra is officially live as an Inference Provider on the Hugging Face Hub. You can now call DeepInfra-hosted models directly from Hugging Face model pages, through our OpenAI-compatible router (use it with any OpenAI SDK), or via the Hugging Face SDKs in Python and JavaScript.

View all