DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Step 3.7 Flash from StepFun is now live on DeepInfra. It's a 198B-parameter sparse Mixture-of-Experts vision-language model that activates only about 11B parameters per token, supports a 256K context window, and exposes three selectable reasoning levels—so you can dial the trade-off between speed, cost, and depth on a per-request basis. We've deployed it at the same competitive pricing you'll find elsewhere, and it's available through our standard OpenAI-compatible API with no special setup.
Step 3.7 Flash wasn't built to win isolated benchmarks. It was built for developers scaling agentic workflows that combine perception, search, and reasoning—parsing a massive financial report in a single pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in a high-throughput pipeline. That focus shows up everywhere in how the model behaves.
For autonomous agents, execution reliability matters more than raw model quality. An agent that drifts from instructions, violates a system constraint, or falls for an adversarial trap mid-trajectory is worse than useless. Step 3.7 Flash is tuned for exactly this kind of long-horizon, multi-turn orchestration.
Step 3.7 Flash pairs its 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding—not bolted-on captioning, but visual grounding that feeds directly into reasoning and retrieval.
In practice that means the model reads dense visual interfaces—UI wireframes, application GUIs, data charts—and maps them into structured code. When a visual asset is incomplete, it can recognize what's missing, run a lookup to fill the gap, and verify its conclusion before answering.
Step 3.7 Flash exposes low, medium, and high reasoning levels through a single reasoning_effort parameter. Use low for latency-sensitive, high-volume calls; reach for high when a task needs deeper deliberation. It's the same model and the same endpoint—you just choose how much thinking to spend per request.
| Token type | Price per 1M tokens |
|---|---|
| Input (cache miss) | $0.20 |
| Input (cache hit) | $0.04 |
| Output | $1.15 |
The aggressive cache-hit rate rewards the prefix-heavy prompts that agentic and multi-turn workloads naturally produce.
Step 3.7 Flash is available today through DeepInfra's OpenAI-compatible API. If you've used DeepInfra before, nothing changes—same API, same setup. Point your client at our endpoint and use stepfun-ai/Step-3.7-Flash as the model name:
from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
# Text + image input, with a chosen reasoning level
completion = client.chat.completions.create(
model="stepfun-ai/Step-3.7-Flash",
reasoning_effort="medium",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this chart, and what does it imply?"},
{"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}},
],
}
],
)
print(completion.choices[0].message.content)
See Step 3.7 Flash for live pricing and usage, browse the full model catalog, or read the docs to start building.
DeepInfra Launches Access to NVIDIA Cosmos 3 World Foundation Models for Physical AIDeepInfra is serving NVIDIA Cosmos 3, the first open world foundation model for physical AI that reasons before it generates, from day zero of its release. Available as two variants—Cosmos 3 Nano and Cosmos 3 Super—these models give developers a cost-efficient foundation for building robots, autonomous vehicles, simulation workflows, and synthetic data generation at scale.
GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra<p>GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the […]</p>
Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano Omni, the first multimodal model in the Nemotron 3 family — a single open model that understands images, video, audio, documents, and text in one unified inference pass.© 2026 DeepInfra. All rights reserved.