DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

We are excited to announce that DeepInfra is an official launch partner for NVIDIA Nemotron™ 3 Nano Omni, the first multimodal model in the Nemotron 3 family.
Nemotron 3 Nano Omni is an open multimodal model that handles everything an agent needs to see and hear — images, video, audio, documents, and text — in a single inference pass. It delivers leading multimodal accuracy, ~9x higher throughput than other open omni models with the same interactivity, resulting in lower cost and better scalability.
On DeepInfra, the model is available from day one with zero setup, low latency, and no operational overhead. You can build and scale always-on multimodal sub-agents — for computer use, document intelligence, and audio-video understanding — using only a few lines of code.
Most multimodal agents today are built by bolting a vision model next to a speech model next to an LLM. Every extra inference pass adds latency, every cross-model handoff fragments context, and orchestration and error handling multiply over long-running workflows. Nemotron 3 Nano Omni replaces that approach with a single unified model that sees, hears, reads, and reasons across modalities in one loop.
The model combines unified vision and audio encoders with a hybrid Mixture of Experts (MoE) and Mamba-Transformer backbone — the same architectural foundation as Nemotron 3 Nano, extended to natively understand and reason across images, video, audio, and text, with text output.
On top of this architecture, 3D convolution layers and Efficient Video Sampling (EVS) keep video reasoning cheap across long clips, and a hybrid MoE design activates ~3B of 30B parameters per token — giving Nemotron 3 Nano Omni inference economics closer to a small dense model while holding quality closer to a much larger one.
These design choices enable:
The 256K-token context window is a core part of the model's design. For multimodal agents, this means holding long screen sessions, multi-hour calls, and mixed-media documents in a single reasoning frame — without dropping critical context mid-task.
Nemotron 3 Nano Omni delivers leading scores across a wide range of multimodal benchmarks, including MathVista, Video-MME, OCRv2, CharXiv, ScreenSpot-Pro, MMLongBench-Doc, WorldSense, Daily Omni, MMAU, and VoiceBench. For more details, check out the NVIDIA technical blog.
Like the rest of the Nemotron 3 family, Nemotron 3 Nano Omni is fully open with access to model weights, training datasets, and development recipes. This transparency enables teams to inspect, customize, and fine-tune the model for their domain-specific use cases such as computer-use agents, document intelligence, or multimodal reasoning.
Nemotron 3 Nano Omni is accessible via DeepInfra's OpenAI-compatible API. You can get started in a few lines of code.
Install the client:
pip install openai
Run your first inference (image input):
from openai import OpenAI
client = OpenAI(
api_key="<your-deepinfra-api-key>",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
messages=[
{"role": "system", "content": "You are a helpful perception agent."},
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the UI state in this screenshot and suggest the next action for an automation agent."},
{"type": "image_url", "image_url": {"url": "https://example.com/screen.png"}},
],
},
],
)
print(response.choices[0].message.content)
Stream responses (text output):
stream = client.chat.completions.create(
model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
messages=[
{"role": "user", "content": "Walk me through a multi-step plan to extract structured data from this invoice PDF."}
],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Streaming applies to response generation (text output). Audio, video, image, and document inputs are supported through the same chat completions endpoint — see the model documentation for the full request schema.
DeepInfra operates with a zero-retention policy. Inputs, outputs, and user data are not stored. The platform is SOC 2 and ISO 27001 certified, following industry best practices for security and privacy. More information is available in our DeepInfra Trust Center.
Visit the Nemotron 3 Nano Omni model page on DeepInfra to explore pricing and start inference instantly. Check out our documentation to learn more about the broader model ecosystem and developer resources.
Have questions or need help? Reach out at feedback@deepinfra.com, join our Discord, or connect with us on X (@DeepInfra) — we're happy to help.
NVIDIA Nemotron 3 Super on DeepInfra: 120B MoE Model<p>NVIDIA’s Nemotron 3 Super runs 120 billion parameters while activating only 12 billion per token — a ratio that makes a real difference when orchestrating multiple agents in parallel. It’s built on a novel architecture called LatentMoE, a hybrid of Mamba-2, Mixture-of-Experts, and Attention layers designed from the ground up for agentic, reasoning, and long-context […]</p>
OpenClaw Security: Prevent Prompt Injection & Supply Chain Attacks<p>In early 2026, the China’s Ministry of Industry and Information Technology issued an emergency warning about an AI agent runtime that had quietly grown to 135,000 GitHub stars. By mid-February, security researchers were tracking a coordinated campaign called ClawHavoc. The Moltbook breach had exposed customer email archives from 41 enterprises. OpenClaw’s maintainers had shipped three […]</p>
Juggernaut FLUX is live on DeepInfra!Juggernaut FLUX is live on DeepInfra!
At DeepInfra, we care about one thing above all: making cutting-edge AI models accessible. Today, we're excited to release the most downloaded model to our platform.
Whether you're a visual artist, developer, or building an app that relies on high-fidelity ...© 2026 DeepInfra. All rights reserved.