DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

We are excited to announce that DeepInfra is an official launch partner for NVIDIA Nemotron™ 3 Nano Omni, the first multimodal model in the Nemotron 3 family.
Nemotron 3 Nano Omni is an open multimodal model that handles everything an agent needs to see and hear — images, video, audio, documents, and text — in a single inference pass. It delivers leading multimodal accuracy, ~9x higher throughput than other open omni models with the same interactivity, resulting in lower cost and better scalability.
On DeepInfra, the model is available from day one with zero setup, low latency, and no operational overhead. You can build and scale always-on multimodal sub-agents — for computer use, document intelligence, and audio-video understanding — using only a few lines of code.
Most multimodal agents today are built by bolting a vision model next to a speech model next to an LLM. Every extra inference pass adds latency, every cross-model handoff fragments context, and orchestration and error handling multiply over long-running workflows. Nemotron 3 Nano Omni replaces that approach with a single unified model that sees, hears, reads, and reasons across modalities in one loop.
The model combines unified vision and audio encoders with a hybrid Mixture of Experts (MoE) and Mamba-Transformer backbone — the same architectural foundation as Nemotron 3 Nano, extended to natively understand and reason across images, video, audio, and text, with text output.
On top of this architecture, 3D convolution layers and Efficient Video Sampling (EVS) keep video reasoning cheap across long clips, and a hybrid MoE design activates ~3B of 30B parameters per token — giving Nemotron 3 Nano Omni inference economics closer to a small dense model while holding quality closer to a much larger one.
These design choices enable:
The 256K-token context window is a core part of the model's design. For multimodal agents, this means holding long screen sessions, multi-hour calls, and mixed-media documents in a single reasoning frame — without dropping critical context mid-task.
Nemotron 3 Nano Omni delivers leading scores across a wide range of multimodal benchmarks, including MathVista, Video-MME, OCRv2, CharXiv, ScreenSpot-Pro, MMLongBench-Doc, WorldSense, Daily Omni, MMAU, and VoiceBench. For more details, check out the NVIDIA technical blog.
Like the rest of the Nemotron 3 family, Nemotron 3 Nano Omni is fully open with access to model weights, training datasets, and development recipes. This transparency enables teams to inspect, customize, and fine-tune the model for their domain-specific use cases such as computer-use agents, document intelligence, or multimodal reasoning.
Nemotron 3 Nano Omni is accessible via DeepInfra's OpenAI-compatible API. You can get started in a few lines of code.
Install the client:
pip install openai
Run your first inference (image input):
from openai import OpenAI
client = OpenAI(
api_key="<your-deepinfra-api-key>",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
messages=[
{"role": "system", "content": "You are a helpful perception agent."},
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the UI state in this screenshot and suggest the next action for an automation agent."},
{"type": "image_url", "image_url": {"url": "https://example.com/screen.png"}},
],
},
],
)
print(response.choices[0].message.content)
Stream responses (text output):
stream = client.chat.completions.create(
model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
messages=[
{"role": "user", "content": "Walk me through a multi-step plan to extract structured data from this invoice PDF."}
],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Streaming applies to response generation (text output). Audio, video, image, and document inputs are supported through the same chat completions endpoint — see the model documentation for the full request schema.
DeepInfra operates with a zero-retention policy. Inputs, outputs, and user data are not stored. The platform is SOC 2 and ISO 27001 certified, following industry best practices for security and privacy. More information is available in our DeepInfra Trust Center.
Visit the Nemotron 3 Nano Omni model page on DeepInfra to explore pricing and start inference instantly. Check out our documentation to learn more about the broader model ecosystem and developer resources.
Have questions or need help? Reach out at feedback@deepinfra.com, join our Discord, or connect with us on X (@DeepInfra) — we're happy to help.
Deploy Custom LLMs on DeepInfraDid you just finetune your favorite model and are wondering where to run it?
Well, we have you covered. Simple API and predictable pricing.
Put your model on huggingface
Use a private repo, if you wish, we don't mind. Create a hf access token just
for the repo for better security.
Create c...
Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)<p>This guide walks you from zero to working: you’ll learn what OCR is (and why PDFs can be tricky), how to turn any PDF—including those with screenshots of tables—into text, and how to let an LLM do the heavy lifting to clean OCR noise, reconstruct tables, and summarize the document. We’ll use DeepInfra’s OpenAI-compatible API […]</p>
Search That Actually Works: A Guide to LLM RerankersSearch relevance isn’t a nice-to-have feature for your site or app. It can make or break the entire user experience.
When a customer searches "best laptop for video editing" and gets results for gaming laptops or budget models, they leave empty-handed.
Embeddings help you find similar content, bu...© 2026 DeepInfra. All rights reserved.