DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

GLM-5.2 is Z.AI’s flagship open-source large language model, engineered for long-horizon coding, agentic, and reasoning tasks. Designed for complex reasoning, advanced software engineering, and large-scale data processing, GLM-5.2 introduces a massive 1,048,576-token context window alongside significant architectural innovations.
Hosted on the DeepInfra platform, GLM-5.2 provides developers with a high-performance, OpenAI-compatible interface. Whether you are building agentic workflows, analyzing entire codebases, or processing lengthy documents, GLM-5.2 offers the stability and intelligence required for next-generation AI applications.
GLM-5.2 was released on June 13, 2026, succeeding GLM-5.1 in the GLM-5 family. Unlike previous iterations, this model is engineered to maintain output quality and stability even when the 1M-token context is fully utilized, allowing for the seamless processing of large datasets and complex, multi-file repositories in a single prompt.
IndexShare and MTP: To support this context window efficiently, Z-AI introduced IndexShare, a mechanism that reuses the same indexer across every four sparse attention layers, resulting in a reported 2.9x reduction in per-token FLOPs at maximum context length. An upgraded Multi-Token Prediction (MTP) layer also optimizes speculative decoding, increasing token acceptance length by up to 20% for faster, more cost-effective generation.
Flexible Reasoning: GLM-5.2 features a “Flexible Effort” system (High and Max modes) that lets users adjust the model’s thinking depth to balance reasoning performance against latency. Z.ai recommends the Max effort level for complex, multi-step tasks.
Open Access: GLM-5.2 is released under the MIT license, allowing unrestricted commercial use, modification, and self-hosting.
GLM-5.2 demonstrates strong performance across industry-standard evaluations, frequently rivaling or approaching proprietary models such as GPT-5.5 and Claude Opus 4.8.
| Category | Benchmark | GLM-5.2 | GLM-5.1 | Qwen3.7-Max | GPT-5.5 | Claude Opus 4.8 |
|---|---|---|---|---|---|---|
| Reasoning | GPQA-Diamond | 91.2 | 86.2 | 90.0 | 93.6 | 93.6 |
| Math | AIME 2026 | 99.2 | 95.3 | 97.0 | 98.3 | 95.7 |
| IMOAnswerBench | 91.0 | 83.8 | 90.0 | — | 83.5 | |
| Coding | SWE-bench Pro | 62.1 | 58.4 | 60.6 | 58.6 | 69.2 |
| FrontierSWE | 74.4 | 30.5 | — | 72.6 | 75.1 | |
| Agentic | MCP-Atlas | 76.8 | 71.8 | 76.4 | 75.3 | 77.8 |
Key Highlights
GLM-5.2 is accessible via DeepInfra’s OpenAI-compatible API, making integration straightforward for developers familiar with standard LLM tooling.
Use your DeepInfra API key in the Authorization header:
Authorization: Bearer <YOUR_DEEPINFRA_API_KEY>
curl -X POST https://api.deepinfra.com/v1/openai/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_API_KEY" \
-d '{
"model": "zai-org/GLM-5.2",
"messages": [
{
"role": "user",
"content": "Explain the concept of speculative decoding in 2 sentences."
}
],
"temperature": 0.7| Parameter | Type | Description |
|---|---|---|
| model | String | Use “zai-org/GLM-5.2”. |
| messages | Array | Conversation history objects. |
| response_format | Object | Set to {“type”: “json_object”} for structured JSON. |
| tools | Array | Definitions for function calling. |
| temperature | Float | Controls randomness (0.0 to 2.0). |
| max_tokens | Integer | Maximum tokens to generate in the response. |
DeepInfra offers a flexible, pay-per-token pricing model for GLM-5.2, with options for both standard and prioritized workloads.
| Tier | Input | Cached Input | Output |
|---|---|---|---|
| Standard | $0.95 / 1M tokens | $0.18 / 1M tokens | $3.00 / 1M tokens |
| Priority (1.5×) | $1.425 / 1M tokens | $0.27 / 1M tokens | $4.50 / 1M tokens |
The Priority Tier, available at 1.5× the standard rate, is designed for workloads requiring higher priority and faster processing.
While standard API access is usage-based, users with high-throughput requirements can deploy Private Endpoints via the DeepInfra Dashboard for dedicated capacity.
GLM-5.2 combines a massive 1M-token context window with strong reasoning and coding capabilities, supported by architectural innovations like IndexShare and a flexible reasoning system. It provides developers with the efficiency and power needed for complex agentic and long-horizon tasks.
To begin building with GLM-5.2, visit the DeepInfra Dashboard to generate your API key and explore private deployment options.
Qwen3.5 122B A10B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 122B A10B Qwen3.5 122B A10B is Alibaba Cloud’s mid-tier multimodal foundation model, released in February 2026. It is a multimodal vision-language Mixture-of-Experts model supporting text, image, and video inputs, designed for native multimodal agent applications. It features 122 billion total parameters with 10 billion activated per token through a hybrid architecture that integrates […]</p>
Best SaaS Tools and API Providers for MiMo-V2.5<p>As LLM architectures grow increasingly complex, the introduction of the MiMo-V2.5 series represents a significant step forward in multimodal capabilities and massive context handling. Integrating a model with a 1M-token context window and native multimodal support (image, video, audio, text) introduces substantial infrastructure considerations. For developers and enterprise architects, the priorities are clear: managing inference […]</p>
How DeepInfra Built on NVIDIA's Inference Stack and Why It Paid OffWhen we built DeepInfra, we made a deliberate bet on the NVIDIA inference software stack. Not as a hedge — as a conviction. Today, that bet is paying off in ways that are easy to measure.© 2026 DeepInfra. All rights reserved.