We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

GLM-5.1 Model Overview: Features, Capabilities & Use Cases
Published on 2026.05.25 by DeepInfra
GLM-5.1 Model Overview: Features, Capabilities & Use Cases

GLM-5.1 is Z.AI’s next-generation flagship model for agentic engineering, released on April 7, 2026 under the MIT license. It is a 754-billion parameter Mixture-of-Experts model with 40 billion active parameters per token, a 202,752-token context window, and up to 131K output tokens. The model is the direct successor to GLM-5, designed specifically for long-horizon autonomous tasks — not just single-turn completions, but sustained, iterative workflows across hundreds of rounds and thousands of tool calls. Weights are available on Hugging Face and the model is available on DeepInfra at deepinfra.com/zai-org/GLM-5.1.

Long-Horizon Agentic Performance

The core design principle of GLM-5.1 is endurance: the ability to keep improving across extended autonomous runs rather than plateauing after initial gains. Previous models — including GLM-5 — tend to exhaust their strategy early and stall. GLM-5.1 is built to keep revising, running experiments, reading results, and identifying blockers across the full length of a task.

Z.AI demonstrated this with two concrete examples. In one, GLM-5.1 built a complete Linux desktop environment autonomously over an 8-hour session, running 655 iterations of planning, execution, testing, and optimization. In another, it increased vector database query throughput to 6.9x the initial production baseline through sustained iterative improvement. These are not single-pass results — they reflect a model designed to improve the longer it runs.

Performance and Benchmarks

GLM-5.1 leads on SWE-Bench Pro (58.4), ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3), and posts the largest improvement over its predecessor on CyberGym (+20.4 points). On general reasoning benchmarks, it is competitive but not the leader — GPT-5.4 and Gemini 3.1 Pro lead on AIME 2026 and GPQA-Diamond. The model is clearly tuned for coding and agentic execution rather than pure mathematical reasoning.

Asterisked (*) competitor scores on HLE with tools were not available from official sources and were re-evaluated by Z.AI under the same conditions used for GLM-5.1.

BenchmarkGLM-5.1GLM-5Qwen3.6-PlusDeepSeek-V3.2Claude Opus 4.6GPT-5.4
HLE (no tools)31.030.528.825.136.739.8
HLE (w/ tools)52.350.450.640.853.1*52.1*
AIME 202695.395.495.195.198.298.7
GPQA-Diamond86.286.090.482.494.392.0
SWE-Bench Pro58.455.156.654.257.7
NL2Repo42.735.937.949.841.3
Terminal-Bench 2.063.556.261.639.368.5
τ³-Bench70.669.270.769.267.172.9

Additional results not in the table: CyberGym 68.7 (up from GLM-5’s 48.3, ahead of Claude Opus 4.6’s 66.6); BrowseComp 68.0 standard / 79.3 with Context Management enabled; MCP-Atlas Public 71.8; Vending Bench 2 $5,634.

Getting Started with the API

GLM-5.1 is available on DeepInfra via an OpenAI-compatible API. No infrastructure setup required — swap in the model identifier and your DeepInfra token.

  • Base URL: https://api.deepinfra.com/v1/openai
  • Model identifier: zai-org/GLM-5.1
  • Authentication: Authorization: Bearer YOUR_DEEPINFRA_API_KEY
  • Supports: JSON mode, function calling, streaming, multi-turn conversations

Example request:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "zai-org/GLM-5.1",
      "messages": [{"role": "user", "content": "Hello!"}]
    }'
copy

The full API reference is available at deepinfra.com/zai-org/GLM-5.1/api.

Pricing

GLM-5.1 on DeepInfra uses usage-based pricing calculated per 1 million tokens:

Token TypePrice per 1M Tokens
Input Tokens$1.05
Output Tokens$3.50
Cached Input Tokens$0.205

The cached input rate of $0.205/1M tokens is particularly relevant for agentic workloads that repeatedly send stable prefixes — system prompts, tool schemas, repo context, or persistent agent instructions. For a detailed breakdown of how token economics play out across providers and workload types, see the GLM-5.1 pricing guide.

For current pricing, visit the DeepInfra pricing page. Private endpoint deployment is also available for teams that need dedicated capacity.

Self-Hosting

Due to the 754B parameter scale, self-hosting GLM-5.1 requires significant hardware — a minimum of 1x NVIDIA HGX B200 (8x B200 GPUs) at full precision. The FP8 quantized checkpoint (zai-org/GLM-5.1-FP8) is the recommended serving target, reducing memory requirements to approximately 860GB while preserving output quality for production workloads. Supported inference engines are vLLM (v0.19.0+) and SGLang (v0.5.10+). The MIT license covers commercial deployment and fine-tuning without restrictions.

Conclusion

GLM-5.1 is the strongest open-weight choice for developers building long-horizon autonomous coding agents. Its benchmark results on SWE-Bench Pro, CyberGym, and NL2Repo reflect a model that was deliberately tuned for the kind of iterative, multi-step engineering work that most coding models struggle to sustain. The trade-offs are real — GPT-5.4 and Gemini 3.1 Pro lead on pure reasoning benchmarks, and the context window (203K) is smaller than some proprietary alternatives — but for agentic coding workflows, the combination of open weights, MIT licensing, and sustained long-run performance makes a credible case.

Visit deepinfra.com/zai-org/GLM-5.1 to try the demo, review API documentation, or grab your API key and start building. For context on how GLM-5.1 compares against other models in the GLM family, the GLM-5 API benchmarks and GLM-4.6 vs DeepSeek-V3.2 comparison are useful reference points.

Related articles
Building a Voice Assistant with Whisper, LLM, and TTSBuilding a Voice Assistant with Whisper, LLM, and TTSLearn how to create a voice assistant using Whisper for speech recognition, LLM for conversation, and TTS for text-to-speech.
How Mixture of Experts Models Changed LLM EconomicsHow Mixture of Experts Models Changed LLM Economics<p>Every open-weight model that has closed the gap with GPT-5.5 and Claude Opus 4.7 this year has one thing in common. DeepSeek V4-Pro: 1.6 trillion parameters, 49 billion active per token. Kimi K2.6: 1 trillion parameters, 32 billion active. GLM-5.1: 744 billion parameters, 40 billion active. MiniMax M2.7: large total parameter count, 10 billion active [&hellip;]</p>
What Is Google TurboQuant and What Does It Mean for Open Source Inference? - Deep InfraWhat Is Google TurboQuant and What Does It Mean for Open Source Inference? - Deep Infra<p>In late March 2026, Google Research published a paper that got more attention outside of academic circles than most AI research does. TurboQuant, a new compression algorithm for the key-value cache in large language models, landed with enough noise that Cloudflare CEO Matthew Prince called it Google&#8217;s DeepSeek moment. The Silicon Valley Pied Piper comparisons [&hellip;]</p>