We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Introducing GLM-5.2 on DeepInfra
Published on 2026.07.01 by DeepInfra
Introducing GLM-5.2 on DeepInfra

GLM-5.2 is Z-AI’s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding improvements over its predecessor make the same point in numbers: DeepSWE goes from 18 to 46.2 between GLM-5.1 and GLM-5.2.

The architecture behind that context window is new. GLM-5.2 introduces IndexShare, which reuses the same indexer across every four sparse attention layers, cutting per-token FLOPs by 2.9× at 1M context length — meaning the longer window doesn’t come with the proportional compute cost you’d expect. The model also ships under an MIT license with no regional restrictions, which puts it in a different category from most models competing at this benchmark tier. It’s now available on DeepInfra under zai-org/GLM-5.2.

What Makes This Model Different

GLM-5.2 is Z-AI’s follow-up to GLM-5.1, and the headline upgrade is a stable 1,048,576-token (1M) context window. Long context support isn’t new, but reliable performance at that scale for long-horizon tasks is harder to deliver than it sounds. The key enabler here is IndexShare, a new architectural design that reuses the same indexer across every four sparse attention layers, cutting per-token FLOPs by 2.9× at 1M context length — a meaningful reduction that makes running very long contexts practical rather than theoretical.

On the inference side, GLM-5.2 ships with an improved multi-token prediction (MTP) layer that increases speculative decoding acceptance length by up to 20% over GLM-5.1. Combined with flexible thinking effort levels for coding tasks — letting you trade latency for quality depending on the task — the model gives developers real levers to tune behavior. For a detailed breakdown of how GLM-5.1 approached agentic engineering before this release, the GLM-5.1 model overview covers the architecture and design decisions that carried forward.

Benchmark improvements over GLM-5.1 are substantial across the board:

BenchmarkGLM-5.1GLM-5.2Δ
HLE31.040.5+9.5
GPQA-Diamond86.291.2+5.0
AIME 202695.399.2+3.9
SWE-bench Pro58.462.1+3.7
DeepSWE18.046.2+28.2
FrontierSWE (Dominance)30.574.4+43.9
Terminal Bench 2.163.581.0+17.5
MCP-Atlas (Public)71.876.8+5.0

The coding gains are where the delta is hardest to ignore. DeepSWE jumps from 18 to 46.2, and FrontierSWE Dominance goes from 30.5 to 74.4 — both suggesting a meaningful shift in how the model handles real-world software engineering tasks, not just benchmark tuning. GLM-5.2 is competitive with DeepSeek-V4-Pro and Qwen3.7-Max across most categories, though it trails Claude Opus 4.8 on SWE-bench Pro (62.1 vs. 69.2).

On capabilities, GLM-5.2 supports function calling and structured JSON output, making it straightforward to drop into agentic pipelines. It handles English and Chinese natively and is available under an MIT license with no regional restrictions. If you want to compare against other available options, the full models catalog has context length, pricing, and capability details across providers.

Getting Started on DeepInfra

GLM-5.2 is available on DeepInfra under the identifier zai-org/GLM-5.2. Pricing is usage-based: Standard Tier runs $0.95 per 1M input tokens and $3.00 per 1M output tokens, with cached input at $0.18 per 1M tokens. If you need guaranteed throughput, Priority Tier is available at 1.5× those rates ($1.425 / $4.50 / $0.27). Private endpoint deployment is also supported for dedicated infrastructure. For a closer look at how GLM-5.1 pricing stacked up across providers before this release, the GLM-5.1 pricing guide gives useful context on where DeepInfra sits in the market.

Access is through a fully OpenAI-compatible API — no infrastructure to manage, no containers to spin up. DeepInfra operates with a zero data-retention policy and is SOC 2 and ISO 27001 certified. If you want to understand the latency and throughput characteristics before committing, the GLM-5.1 API benchmarks offer a reasonable proxy while GLM-5.2-specific numbers are published.

Here’s everything you need to make your first call:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "zai-org/GLM-5.2",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'
copy
from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="zai-org/GLM-5.2",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
copy
import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "zai-org/GLM-5.2",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);
copy

The only things that differ from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name. The official OpenAI Python and Node.js SDKs work without modification. GLM-5.2 also supports JSON output mode and function calling out of the box, so tool-use workflows slot in without any extra wiring. You can explore the full GLM-5.2 API reference for parameter details, supported endpoints, and response schemas.

For voice-enabled use cases, GLM-5.2 voice is also available on DeepInfra — worth noting if your pipeline involves audio I/O alongside the text and tool-use workflows.

Conclusion

GLM-5.2 is worth evaluating on a few concrete grounds: a million-token context window that holds up under load, coding benchmark gains that look more like a capability jump than incremental tuning, and an MIT license that removes the friction you’d normally expect at this tier. For developers building document-heavy pipelines, long-running agents, or multi-step coding workflows, those properties are practically useful rather than just impressive on paper.

If you’ve been waiting for a high-context model you can deploy freely and wire into agentic tooling without fighting the license, this is a reasonable place to start. Head to the GLM-5.2 demo to run a few calls and see how it handles your workload.

Related articles
DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost AnalysisDeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis<p>About DeepSeek V4 Pro DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek&#8217;s flagship open-weight models. The model introduces a hybrid attention [&hellip;]</p>
How to use CivitAI LoRAs: 5-Minute AI Guide to Stunning Double Exposure ArtHow to use CivitAI LoRAs: 5-Minute AI Guide to Stunning Double Exposure ArtLearn how to create mesmerizing double exposure art in minutes using AI. This guide shows you how to set up a LoRA model from CivitAI and create stunning artistic compositions that blend multiple images into dreamlike masterpieces.
GLM-5.2 Pricing, Benchmarks, and Cost ComparisonGLM-5.2 Pricing, Benchmarks, and Cost Comparison<p>If you care about long-context reasoning but don&#8217;t want to lock yourself into a closed model, GLM 5.2 is worth attention for one simple reason: it pairs a 1M-token context window with open weights, MIT licensing, and a real provider market instead of a single take-it-or-leave-it endpoint. That makes it unusually relevant for teams doing [&hellip;]</p>