DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Z.AI’s GLM-5.1 scores 58.4 on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3) and GPT-5.4 (57.7) on real-world software engineering tasks. It’s the direct successor to GLM-5, designed for agentic engineering: long-horizon coding tasks, terminal operations, and repository-level work. The core design premise is that previous models, including GLM-5, tend to plateau after their initial gains — GLM-5.1 is built to keep improving across hundreds of rounds and thousands of tool calls.
What makes that architectural choice meaningful in practice is the model’s capacity for iterative strategy revision: breaking down ambiguous problems, running experiments, reading results, and identifying blockers rather than burning through a fixed repertoire early. It carries a 202,752-token context window, supports function calling and JSON natively, and ships under an MIT license — a meaningful detail for teams thinking about deployment flexibility. At $1.05 per million input tokens and $3.50 per million output tokens, it sits at a competitive price point relative to the frontier models it benchmarks against. It’s now available on DeepInfra.
GLM-5.1 is Z.AI’s successor to GLM-5, built around a specific thesis: most models hit a performance ceiling on long-running agentic tasks and then stall. GLM-5.1 is explicitly designed to keep improving as it’s given more time — sustaining performance across hundreds of rounds and thousands of tool calls rather than exhausting its strategy early.
The clearest evidence shows up in coding and terminal benchmarks, where GLM-5.1 pulls ahead of its predecessor by meaningful margins:
| Benchmark | GLM-5.1 | GLM-5 | Notable Comparisons |
|---|---|---|---|
| SWE-Bench Pro | 58.4 | 55.1 | Claude Opus 4.6: 57.3, GPT-5.4: 57.7 |
| NL2Repo | 42.7 | 35.9 | Claude Opus 4.6: 49.8, GPT-5.4: 41.3 |
| Terminal-Bench 2.0 | 63.5 | 56.2 | Claude Opus 4.6: 65.4 |
| CyberGym | 68.7 | 48.3 | Claude Opus 4.6: 66.6 |
On SWE-Bench Pro and NL2Repo, GLM-5.1 lands ahead of both Claude Opus 4.6 and GPT-5.4. CyberGym sees the most dramatic jump: from 48.3 to 68.7, beating Claude Opus 4.6’s 66.6. GLM-5.1 is also available on NVIDIA’s build platform, which gives you another access path if you’re already working within that ecosystem.
On general reasoning, the gains are more modest. GPQA-Diamond moves from 86.0 to 86.2, math benchmarks are roughly flat or slightly down (HMMT Nov: 96.9 → 94.0), and HLE with tools goes from 50.4 to 52.3. The model is tuned for agentic work, not pure reasoning competitions. GLM-5.1 also scores 79.3 on BrowseComp with context management enabled, ahead of DeepSeek-V3.2 (51.4) and competitive with other top-tier models.
The model supports a 202,752-token context window with JSON and function calling — both required for real tool-use pipelines. It handles English and Chinese, is MIT-licensed, and is served in fp4 quantization on DeepInfra under zai-org/GLM-5.1. If you want to understand the broader GLM model lineage, the GLM-4.5 blog post covers the foundation model that preceded this generation.
GLM-5.1 is available now on DeepInfra under the identifier zai-org/GLM-5.1 as a public endpoint. Pricing is usage-based: $1.05 per 1M input tokens, $3.50 per 1M output tokens, and $0.205 per 1M cached tokens. Private endpoint deployment is also supported if you need dedicated capacity — configure that directly from the DeepInfra dashboard.
DeepInfra gives you access to GLM-5.1 through an OpenAI-compatible API with zero infrastructure setup. DeepInfra operates with a zero-retention policy and is SOC 2 and ISO 27001 certified. If you’re planning to use GLM-5.1 for production coding workflows — Claude Code, Kilo Code, Cline, or similar tools — the GLM Coding Plan is worth reviewing for team-level access options.
To make your first call, grab your API key from the Dashboard and swap in the model identifier:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "zai-org/GLM-5.1",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="zai-org/GLM-5.1",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: "https://api.deepinfra.com/v1/openai",
});
const response = await openai.chat.completions.create({
model: "zai-org/GLM-5.1",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);The only things that change from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name — the official OpenAI Python and Node.js SDKs work without any modifications. Head to deepinfra.com/zai-org/GLM-5.1 to start building.
GLM-5.1 makes a credible case for itself in the scenarios where agentic models tend to break down — long-running tasks, messy repositories, and multi-step terminal workflows that demand sustained reasoning rather than a single flash of capability. The benchmark numbers against Claude Opus 4.6 and GPT-5.4 aren’t cherry-picked narrow wins; they reflect a model that was deliberately tuned for the kind of work developers actually need to automate.
That opens up real engineering applications: autonomous PR triage pipelines, self-directed debugging agents, or repo-scale refactoring tools that don’t fall apart midway through. If any of that maps to what you’re building, GLM-5.1 is worth running through your eval pipeline. It’s also worth keeping in mind that “agentic model” here means something specific — not just a model with tool access, but one designed around the generalized linear structure of iterative, multi-step problem solving that real engineering tasks actually demand. Head to deepinfra.com/zai-org/GLM-5.1 to get started.
Pricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is […]</p>
Step 3.7 Flash is Live on DeepInfra: An Agentic, Multimodal Model Built for ProductionStepFun's Step 3.7 Flash is now live on DeepInfra. It's a 198B-parameter sparse MoE vision-language model with just ~11B active parameters per token, a 256K context window, and three selectable reasoning levels—purpose-built for high-throughput agentic workflows that combine perception, search, and reasoning.
OpenClaw Use Cases That Deliver Real ROI<p>An OpenClaw agent that reads your email, opens pull requests, and watches a server is only useful if running it doesn’t feel like leaving the meter running. That’s the quiet constraint behind every OpenClaw use cases discussion. Most of the workflows people show off (morning briefings, multi-agent research, ambient monitoring) only make sense if each […]</p>
© 2026 DeepInfra. All rights reserved.