Introducing GLM-5.2 on DeepInfra

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.07.01 by DeepInfra

GLM-5.2 is Z-AI’s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding improvements over its predecessor make the same point in numbers: DeepSWE goes from 18 to 46.2 between GLM-5.1 and GLM-5.2.

The architecture behind that context window is new. GLM-5.2 introduces IndexShare, which reuses the same indexer across every four sparse attention layers, cutting per-token FLOPs by 2.9× at 1M context length — meaning the longer window doesn’t come with the proportional compute cost you’d expect. The model also ships under an MIT license with no regional restrictions, which puts it in a different category from most models competing at this benchmark tier. It’s now available on DeepInfra under zai-org/GLM-5.2.

What Makes This Model Different

GLM-5.2 is Z-AI’s follow-up to GLM-5.1, and the headline upgrade is a stable 1,048,576-token (1M) context window. Long context support isn’t new, but reliable performance at that scale for long-horizon tasks is harder to deliver than it sounds. The key enabler here is IndexShare, a new architectural design that reuses the same indexer across every four sparse attention layers, cutting per-token FLOPs by 2.9× at 1M context length — a meaningful reduction that makes running very long contexts practical rather than theoretical.

On the inference side, GLM-5.2 ships with an improved multi-token prediction (MTP) layer that increases speculative decoding acceptance length by up to 20% over GLM-5.1. Combined with flexible thinking effort levels for coding tasks — letting you trade latency for quality depending on the task — the model gives developers real levers to tune behavior. For a detailed breakdown of how GLM-5.1 approached agentic engineering before this release, the GLM-5.1 model overview covers the architecture and design decisions that carried forward.

Benchmark improvements over GLM-5.1 are substantial across the board:

Benchmark	GLM-5.1	GLM-5.2	Δ
HLE	31.0	40.5	+9.5
GPQA-Diamond	86.2	91.2	+5.0
AIME 2026	95.3	99.2	+3.9
SWE-bench Pro	58.4	62.1	+3.7
DeepSWE	18.0	46.2	+28.2
FrontierSWE (Dominance)	30.5	74.4	+43.9
Terminal Bench 2.1	63.5	81.0	+17.5
MCP-Atlas (Public)	71.8	76.8	+5.0

The coding gains are where the delta is hardest to ignore. DeepSWE jumps from 18 to 46.2, and FrontierSWE Dominance goes from 30.5 to 74.4 — both suggesting a meaningful shift in how the model handles real-world software engineering tasks, not just benchmark tuning. GLM-5.2 is competitive with DeepSeek-V4-Pro and Qwen3.7-Max across most categories, though it trails Claude Opus 4.8 on SWE-bench Pro (62.1 vs. 69.2).

On capabilities, GLM-5.2 supports function calling and structured JSON output, making it straightforward to drop into agentic pipelines. It handles English and Chinese natively and is available under an MIT license with no regional restrictions. If you want to compare against other available options, the full models catalog has context length, pricing, and capability details across providers.

Getting Started on DeepInfra

GLM-5.2 is available on DeepInfra under the identifier zai-org/GLM-5.2. Pricing is usage-based: Standard Tier runs $0.95 per 1M input tokens and $3.00 per 1M output tokens, with cached input at $0.18 per 1M tokens. If you need guaranteed throughput, Priority Tier is available at 1.5× those rates ($1.425 / $4.50 / $0.27). Private endpoint deployment is also supported for dedicated infrastructure. For a closer look at how GLM-5.1 pricing stacked up across providers before this release, the GLM-5.1 pricing guide gives useful context on where DeepInfra sits in the market.

Access is through a fully OpenAI-compatible API — no infrastructure to manage, no containers to spin up. DeepInfra operates with a zero data-retention policy and is SOC 2 and ISO 27001 certified. If you want to understand the latency and throughput characteristics before committing, the GLM-5.1 API benchmarks offer a reasonable proxy while GLM-5.2-specific numbers are published.

Here’s everything you need to make your first call:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "zai-org/GLM-5.2",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'copy

from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="zai-org/GLM-5.2",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)copy

import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "zai-org/GLM-5.2",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);copy

The only things that differ from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name. The official OpenAI Python and Node.js SDKs work without modification. GLM-5.2 also supports JSON output mode and function calling out of the box, so tool-use workflows slot in without any extra wiring. You can explore the full GLM-5.2 API reference for parameter details, supported endpoints, and response schemas.

For voice-enabled use cases, GLM-5.2 voice is also available on DeepInfra — worth noting if your pipeline involves audio I/O alongside the text and tool-use workflows.

Conclusion

GLM-5.2 is worth evaluating on a few concrete grounds: a million-token context window that holds up under load, coding benchmark gains that look more like a capability jump than incremental tuning, and an MIT license that removes the friction you’d normally expect at this tier. For developers building document-heavy pipelines, long-running agents, or multi-step coding workflows, those properties are practically useful rather than just impressive on paper.

If you’ve been waiting for a high-context model you can deploy freely and wire into agentic tooling without fighting the license, this is a reasonable place to start. Head to the GLM-5.2 demo to run a few calls and see how it handles your workload.

Juggernaut FLUX is live on DeepInfra!Juggernaut FLUX is live on DeepInfra! At DeepInfra, we care about one thing above all: making cutting-edge AI models accessible. Today, we're excited to release the most downloaded model to our platform. Whether you're a visual artist, developer, or building an app that relies on high-fidelity ...

Gemma 4 on DeepInfra: Fast & Scalable Open AI Models<p>Google DeepMind’s Gemma 4 scored 88.3% on AIME 2026 mathematics benchmarks in its 26B MoE variant — compared to 20.8% for its predecessor, Gemma 3 27B. That’s not an incremental update. The family spans four model sizes designed for hardware targets as different as a Raspberry Pi and a consumer GPU workstation, with every model […]</p>

GLM-5.1 on DeepInfra: Z.AI’s Agentic Engineering Model<p>Z.AI’s GLM-5.1 scores 58.4 on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3) and GPT-5.4 (57.7) on real-world software engineering tasks. It’s the direct successor to GLM-5, designed for agentic engineering: long-horizon coding tasks, terminal operations, and repository-level work. The core design premise is that previous models, including GLM-5, tend to plateau after […]</p>

View all