DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Z.AI’s GLM-5.1 scores 58.4 on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3) and GPT-5.4 (57.7) on real-world software engineering tasks. It’s the direct successor to GLM-5, designed for agentic engineering: long-horizon coding tasks, terminal operations, and repository-level work. The core design premise is that previous models, including GLM-5, tend to plateau after their initial gains — GLM-5.1 is built to keep improving across hundreds of rounds and thousands of tool calls.
What makes that architectural choice meaningful in practice is the model’s capacity for iterative strategy revision: breaking down ambiguous problems, running experiments, reading results, and identifying blockers rather than burning through a fixed repertoire early. It carries a 202,752-token context window, supports function calling and JSON natively, and ships under an MIT license — a meaningful detail for teams thinking about deployment flexibility. At $1.05 per million input tokens and $3.50 per million output tokens, it sits at a competitive price point relative to the frontier models it benchmarks against. It’s now available on DeepInfra.
GLM-5.1 is Z.AI’s successor to GLM-5, built around a specific thesis: most models hit a performance ceiling on long-running agentic tasks and then stall. GLM-5.1 is explicitly designed to keep improving as it’s given more time — sustaining performance across hundreds of rounds and thousands of tool calls rather than exhausting its strategy early.
The clearest evidence shows up in coding and terminal benchmarks, where GLM-5.1 pulls ahead of its predecessor by meaningful margins:
| Benchmark | GLM-5.1 | GLM-5 | Notable Comparisons |
|---|---|---|---|
| SWE-Bench Pro | 58.4 | 55.1 | Claude Opus 4.6: 57.3, GPT-5.4: 57.7 |
| NL2Repo | 42.7 | 35.9 | Claude Opus 4.6: 49.8, GPT-5.4: 41.3 |
| Terminal-Bench 2.0 | 63.5 | 56.2 | Claude Opus 4.6: 65.4 |
| CyberGym | 68.7 | 48.3 | Claude Opus 4.6: 66.6 |
On SWE-Bench Pro and NL2Repo, GLM-5.1 lands ahead of both Claude Opus 4.6 and GPT-5.4. CyberGym sees the most dramatic jump: from 48.3 to 68.7, beating Claude Opus 4.6’s 66.6. GLM-5.1 is also available on NVIDIA’s build platform, which gives you another access path if you’re already working within that ecosystem.
On general reasoning, the gains are more modest. GPQA-Diamond moves from 86.0 to 86.2, math benchmarks are roughly flat or slightly down (HMMT Nov: 96.9 → 94.0), and HLE with tools goes from 50.4 to 52.3. The model is tuned for agentic work, not pure reasoning competitions. GLM-5.1 also scores 79.3 on BrowseComp with context management enabled, ahead of DeepSeek-V3.2 (51.4) and competitive with other top-tier models.
The model supports a 202,752-token context window with JSON and function calling — both required for real tool-use pipelines. It handles English and Chinese, is MIT-licensed, and is served in fp4 quantization on DeepInfra under zai-org/GLM-5.1. If you want to understand the broader GLM model lineage, the GLM-4.5 blog post covers the foundation model that preceded this generation.
GLM-5.1 is available now on DeepInfra under the identifier zai-org/GLM-5.1 as a public endpoint. Pricing is usage-based: $1.05 per 1M input tokens, $3.50 per 1M output tokens, and $0.205 per 1M cached tokens. Private endpoint deployment is also supported if you need dedicated capacity — configure that directly from the DeepInfra dashboard.
DeepInfra gives you access to GLM-5.1 through an OpenAI-compatible API with zero infrastructure setup. DeepInfra operates with a zero-retention policy and is SOC 2 and ISO 27001 certified. If you’re planning to use GLM-5.1 for production coding workflows — Claude Code, Kilo Code, Cline, or similar tools — the GLM Coding Plan is worth reviewing for team-level access options.
To make your first call, grab your API key from the Dashboard and swap in the model identifier:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "zai-org/GLM-5.1",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="zai-org/GLM-5.1",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: "https://api.deepinfra.com/v1/openai",
});
const response = await openai.chat.completions.create({
model: "zai-org/GLM-5.1",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);The only things that change from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name — the official OpenAI Python and Node.js SDKs work without any modifications. Head to deepinfra.com/zai-org/GLM-5.1 to start building.
GLM-5.1 makes a credible case for itself in the scenarios where agentic models tend to break down — long-running tasks, messy repositories, and multi-step terminal workflows that demand sustained reasoning rather than a single flash of capability. The benchmark numbers against Claude Opus 4.6 and GPT-5.4 aren’t cherry-picked narrow wins; they reflect a model that was deliberately tuned for the kind of work developers actually need to automate.
That opens up real engineering applications: autonomous PR triage pipelines, self-directed debugging agents, or repo-scale refactoring tools that don’t fall apart midway through. If any of that maps to what you’re building, GLM-5.1 is worth running through your eval pipeline. It’s also worth keeping in mind that “agentic model” here means something specific — not just a model with tool access, but one designed around the generalized linear structure of iterative, multi-step problem solving that real engineering tasks actually demand. Head to deepinfra.com/zai-org/GLM-5.1 to get started.
Kimi K2.5 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2.5 Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 […]</p>
Qwen3.5 2B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 2B (Reasoning) Qwen3.5 2B is a compact 2-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud’s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural […]</p>
Best API Providers for DeepSeek V4 in 2026<p>DeepSeek V4 is available across a range of hosted API providers, each with different pricing, performance, and deployment trade-offs. The model comes in two variants: V4 Pro, a 1.6 trillion total parameter Mixture-of-Experts model with 49 billion active parameters and a 1M token context window, and V4 Flash, a lighter 284B total parameter variant built […]</p>
© 2026 DeepInfra. All rights reserved.