GLM-5.1 Model Overview: Features, Capabilities & Use Cases

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.05.25 by DeepInfra

GLM-5.1 is Z.AI’s next-generation flagship model for agentic engineering, released on April 7, 2026 under the MIT license. It is a 754-billion parameter Mixture-of-Experts model with 40 billion active parameters per token, a 202,752-token context window, and up to 131K output tokens. The model is the direct successor to GLM-5, designed specifically for long-horizon autonomous tasks — not just single-turn completions, but sustained, iterative workflows across hundreds of rounds and thousands of tool calls. Weights are available on Hugging Face and the model is available on DeepInfra at deepinfra.com/zai-org/GLM-5.1.

Long-Horizon Agentic Performance

The core design principle of GLM-5.1 is endurance: the ability to keep improving across extended autonomous runs rather than plateauing after initial gains. Previous models — including GLM-5 — tend to exhaust their strategy early and stall. GLM-5.1 is built to keep revising, running experiments, reading results, and identifying blockers across the full length of a task.

Z.AI demonstrated this with two concrete examples. In one, GLM-5.1 built a complete Linux desktop environment autonomously over an 8-hour session, running 655 iterations of planning, execution, testing, and optimization. In another, it increased vector database query throughput to 6.9x the initial production baseline through sustained iterative improvement. These are not single-pass results — they reflect a model designed to improve the longer it runs.

Performance and Benchmarks

GLM-5.1 leads on SWE-Bench Pro (58.4), ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3), and posts the largest improvement over its predecessor on CyberGym (+20.4 points). On general reasoning benchmarks, it is competitive but not the leader — GPT-5.4 and Gemini 3.1 Pro lead on AIME 2026 and GPQA-Diamond. The model is clearly tuned for coding and agentic execution rather than pure mathematical reasoning.

Asterisked (*) competitor scores on HLE with tools were not available from official sources and were re-evaluated by Z.AI under the same conditions used for GLM-5.1.

Benchmark	GLM-5.1	GLM-5	Qwen3.6-Plus	DeepSeek-V3.2	Claude Opus 4.6	GPT-5.4
HLE (no tools)	31.0	30.5	28.8	25.1	36.7	39.8
HLE (w/ tools)	52.3	50.4	50.6	40.8	53.1*	52.1*
AIME 2026	95.3	95.4	95.1	95.1	98.2	98.7
GPQA-Diamond	86.2	86.0	90.4	82.4	94.3	92.0
SWE-Bench Pro	58.4	55.1	56.6	—	54.2	57.7
NL2Repo	42.7	35.9	37.9	—	49.8	41.3
Terminal-Bench 2.0	63.5	56.2	61.6	39.3	68.5	—
τ³-Bench	70.6	69.2	70.7	69.2	67.1	72.9

Additional results not in the table: CyberGym 68.7 (up from GLM-5’s 48.3, ahead of Claude Opus 4.6’s 66.6); BrowseComp 68.0 standard / 79.3 with Context Management enabled; MCP-Atlas Public 71.8; Vending Bench 2 $5,634.

Getting Started with the API

GLM-5.1 is available on DeepInfra via an OpenAI-compatible API. No infrastructure setup required — swap in the model identifier and your DeepInfra token.

Base URL: https://api.deepinfra.com/v1/openai
Model identifier: zai-org/GLM-5.1
Authentication: Authorization: Bearer YOUR_DEEPINFRA_API_KEY
Supports: JSON mode, function calling, streaming, multi-turn conversations

Example request:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "zai-org/GLM-5.1",
      "messages": [{"role": "user", "content": "Hello!"}]
    }'copy

The full API reference is available at deepinfra.com/zai-org/GLM-5.1/api.

Pricing

GLM-5.1 on DeepInfra uses usage-based pricing calculated per 1 million tokens:

Token Type	Price per 1M Tokens
Input Tokens	$1.05
Output Tokens	$3.50
Cached Input Tokens	$0.205

The cached input rate of $0.205/1M tokens is particularly relevant for agentic workloads that repeatedly send stable prefixes — system prompts, tool schemas, repo context, or persistent agent instructions. For a detailed breakdown of how token economics play out across providers and workload types, see the GLM-5.1 pricing guide.

For current pricing, visit the DeepInfra pricing page. Private endpoint deployment is also available for teams that need dedicated capacity.

Self-Hosting

Due to the 754B parameter scale, self-hosting GLM-5.1 requires significant hardware — a minimum of 1x NVIDIA HGX B200 (8x B200 GPUs) at full precision. The FP8 quantized checkpoint (zai-org/GLM-5.1-FP8) is the recommended serving target, reducing memory requirements to approximately 860GB while preserving output quality for production workloads. Supported inference engines are vLLM (v0.19.0+) and SGLang (v0.5.10+). The MIT license covers commercial deployment and fine-tuning without restrictions.

Conclusion

GLM-5.1 is the strongest open-weight choice for developers building long-horizon autonomous coding agents. Its benchmark results on SWE-Bench Pro, CyberGym, and NL2Repo reflect a model that was deliberately tuned for the kind of iterative, multi-step engineering work that most coding models struggle to sustain. The trade-offs are real — GPT-5.4 and Gemini 3.1 Pro lead on pure reasoning benchmarks, and the context window (203K) is smaller than some proprietary alternatives — but for agentic coding workflows, the combination of open weights, MIT licensing, and sustained long-run performance makes a credible case.

Visit deepinfra.com/zai-org/GLM-5.1 to try the demo, review API documentation, or grab your API key and start building. For context on how GLM-5.1 compares against other models in the GLM family, the GLM-5 API benchmarks and GLM-4.6 vs DeepSeek-V3.2 comparison are useful reference points.

Best Models for OpenClaw: Top Picks for Agentic Workloads<p>When you configure OpenClaw for the first time, the model picker looks like a minor config detail. It isn’t. The model you connect decides whether your agents complete tasks reliably or fall apart halfway through a multi-step workflow. It sets what you pay per completed job, not just per token. And it determines whether your […]</p>

Guaranteed JSON output on Open-Source LLMs.DeepInfra is proud to announce that we have released "JSON mode" across all of our text language models. It is available through the "response_format" object, which currently supports only {"type": "json_object"} Our JSON mode will guarantee that all tokens returned in the output of a langua...

Power the Next Era of Image Generation with FLUX.2 Visual Intelligence on DeepInfraDeepInfra is excited to support FLUX.2 from day zero, bringing the newest visual intelligence model from Black Forest Labs to our platform at launch. We make it straightforward for developers, creators, and enterprises to run the model with high performance, transparent pricing, and an API designed for productivity.

View all