DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

DeepSeek V4 Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model from DeepSeek, released on April 24, 2026 under the MIT license. It is designed for advanced reasoning, complex software engineering, and long-running agentic tasks, and arrives alongside DeepSeek-V4-Flash, a lighter 284B-parameter variant built for faster, lower-cost inference. The V4 series is DeepSeek’s first two-tier lineup and introduces a new architecture — the first from the lab since V3. Both models are hybrid thinking/non-thinking and support a 1 million token context window.
The V4 series is built on several technical advances over DeepSeek-V3.2:
The V4-Pro-Base model shows consistent improvements over V3.2 across standard academic benchmarks:
| Benchmark (Metric) | DeepSeek-V3.2-Base | DeepSeek-V4-Flash-Base | DeepSeek-V4-Pro-Base |
|---|---|---|---|
| MMLU (EM) | 87.8 | 88.7 | 90.1 |
| MMLU-Pro (EM) | 65.5 | 68.3 | 73.5 |
| GSM8K (8-shot) | 91.1 | 90.8 | 92.6 |
| HumanEval (Pass@1) | 62.8 | 69.5 | 76.8 |
In its maximum reasoning effort mode (V4-Pro-Max), the model competes directly with leading closed-source systems:
| Benchmark (Metric) | DS-V4-Pro Max | GPT-5.4 xHigh | Gemini-3.1-Pro High | Opus-4.6 Max |
|---|---|---|---|---|
| LiveCodeBench (Pass@1) | 93.5 | — | 91.7 | 88.8 |
| GPQA Diamond (Pass@1) | 90.1 | 93.0 | 94.3 | 91.3 |
| SWE Verified (Resolved) | 80.6 | — | 80.6 | 80.8 |
A few additional results worth noting:
DeepSeek-V4-Pro is available for immediate integration via the DeepInfra platform under the model identifier deepseek-ai/DeepSeek-V4-Pro. Access the model at deepinfra.com/deepseek-ai/DeepSeek-V4-Pro.
Reasoning Modes
A key feature of DeepSeek V4 is configurable reasoning depth. Developers can select the level of thinking effort per request, trading latency for analytical depth:
| Reasoning Mode | Characteristics | Typical Use Cases |
|---|---|---|
| Non-think | Fast, intuitive, low-latency | Routine tasks, simple chat, low-risk decisions |
| Think High | Logical analysis, moderate latency | Complex problem-solving, planning, coding |
| Think Max | Maximum reasoning depth | Hard agentic tasks, boundary-pushing logic |
Response Format
The model’s output structure changes based on the selected mode, using <think> tags to encapsulate internal chain-of-thought reasoning:
JSON output is supported across all modes. The thinking and summary content are embedded within the standard JSON response body.
DeepSeek V4 Pro is available on DeepInfra with usage-based pricing calculated per million tokens:
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $1.74 |
| Output Tokens | $3.48 |
| Cached Input Tokens | $0.145 |
A note on cost in practice: Think Max mode is token-intensive. On the Artificial Analysis Intelligence Index, V4 Pro (Max) used approximately 190M output tokens — far above the median of 47M for comparable open-weights models — bringing the total benchmark run cost to $1,071. That is still more than 4x cheaper than running the same benchmark on Claude Opus 4.7 ($4,811). For general output token pricing, the gap is larger: at $3.48/1M output tokens versus $25/1M for Claude Opus 4.7, V4 Pro is approximately 7x cheaper on output. For applications where Think Max mode generates long responses, monitoring output token usage is important.
Open-Source vs Closed-Source AI Models: Is the Gap Worth It?<p>The Artificial Analysis Intelligence Index sits at a ceiling of 57. Three frontier models — Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5 — all land in that band. Meanwhile, four open-weight models released between February and April 2026 now score 50 or above on the same index. A year ago, the best open-weight […]</p>
Best SaaS Platforms for Deploying Gemma 4 in 2026<p>Gemma 4 is available across a range of platforms — from fully managed API providers to local runners and no-code builders. The right choice depends on what you’re optimizing for: cost, latency, data privacy, local execution, or zero infrastructure overhead. This guide breaks down the top options by use case so you can match the […]</p>
Qwen3 Coder 480B A35B API Benchmarks: Latency & Cost<p>About Qwen3 Coder 480B A35B Instruct Qwen3 Coder 480B A35B Instruct is a state-of-the-art large language model developed by the Qwen team at Alibaba Cloud, specifically designed for code generation and agentic coding tasks. It is a Mixture-of-Experts (MoE) model with 480 billion total parameters and 35 billion active parameters per inference, enabling high performance […]</p>
© 2026 DeepInfra. All rights reserved.