We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfra

Published on 2026.04.28 by Aray Sultanbekova

Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfra

We are excited to announce that DeepInfra is an official launch partner for NVIDIA Nemotron™ 3 Nano Omni, the first multimodal model in the Nemotron 3 family.

Nemotron 3 Nano Omni is an open multimodal model that handles everything an agent needs to see and hear — images, video, audio, documents, and text — in a single inference pass. It delivers leading multimodal accuracy, ~9x higher throughput than other open omni models with the same interactivity, resulting in lower cost and better scalability.

On DeepInfra, the model is available from day one with zero setup, low latency, and no operational overhead. You can build and scale always-on multimodal sub-agents — for computer use, document intelligence, and audio-video understanding — using only a few lines of code.

What Makes Nemotron 3 Nano Omni Different

Most multimodal agents today are built by bolting a vision model next to a speech model next to an LLM. Every extra inference pass adds latency, every cross-model handoff fragments context, and orchestration and error handling multiply over long-running workflows. Nemotron 3 Nano Omni replaces that approach with a single unified model that sees, hears, reads, and reasons across modalities in one loop.

The model combines unified vision and audio encoders with a hybrid Mixture of Experts (MoE) and Mamba-Transformer backbone — the same architectural foundation as Nemotron 3 Nano, extended to natively understand and reason across images, video, audio, and text, with text output.

On top of this architecture, 3D convolution layers and Efficient Video Sampling (EVS) keep video reasoning cheap across long clips, and a hybrid MoE design activates ~3B of 30B parameters per token — giving Nemotron 3 Nano Omni inference economics closer to a small dense model while holding quality closer to a much larger one.

These design choices enable:

Strong multimodal understanding across OCR, vision, audio, and combined audio-video workloads
Efficient video reasoning through temporal-aware perception and optimized video sampling
Higher system efficiency and scalability ~9x for video and ~7x for multi-document use cases
Simplified multimodal pipelines, with a single unified model

The 256K-token context window is a core part of the model's design. For multimodal agents, this means holding long screen sessions, multi-hour calls, and mixed-media documents in a single reasoning frame — without dropping critical context mid-task.

Nemotron 3 Nano Omni delivers leading scores across a wide range of multimodal benchmarks, including MathVista, Video-MME, OCRv2, CharXiv, ScreenSpot-Pro, MMLongBench-Doc, WorldSense, Daily Omni, MMAU, and VoiceBench. For more details, check out the NVIDIA technical blog.

Like the rest of the Nemotron 3 family, Nemotron 3 Nano Omni is fully open with access to model weights, training datasets, and development recipes. This transparency enables teams to inspect, customize, and fine-tune the model for their domain-specific use cases such as computer-use agents, document intelligence, or multimodal reasoning.

Getting Started on DeepInfra

Nemotron 3 Nano Omni is accessible via DeepInfra's OpenAI-compatible API. You can get started in a few lines of code.

Install the client:

pip install openai
copy

Run your first inference (image input):

from openai import OpenAI

client = OpenAI(
    api_key="<your-deepinfra-api-key>",
    base_url="https://api.deepinfra.com/v1/openai",
)

response = client.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
    messages=[
        {"role": "system", "content": "You are a helpful perception agent."},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the UI state in this screenshot and suggest the next action for an automation agent."},
                {"type": "image_url", "image_url": {"url": "https://example.com/screen.png"}},
            ],
        },
    ],
)

print(response.choices[0].message.content)
copy

Stream responses (text output):

stream = client.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
    messages=[
        {"role": "user", "content": "Walk me through a multi-step plan to extract structured data from this invoice PDF."}
    ],
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
copy

Streaming applies to response generation (text output). Audio, video, image, and document inputs are supported through the same chat completions endpoint — see the model documentation for the full request schema.

Enterprise-Grade Security and Privacy

DeepInfra operates with a zero-retention policy. Inputs, outputs, and user data are not stored. The platform is SOC 2 and ISO 27001 certified, following industry best practices for security and privacy. More information is available in our DeepInfra Trust Center.

Start Building

Visit the Nemotron 3 Nano Omni model page on DeepInfra to explore pricing and start inference instantly. Check out our documentation to learn more about the broader model ecosystem and developer resources.

Have questions or need help? Reach out at feedback@deepinfra.com, join our Discord, or connect with us on X (@DeepInfra) — we're happy to help.

DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost AnalysisAbout DeepSeek V4 Pro DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek’s flagship open-weight models. The model introduces a hybrid attention […]

Introducing GLM-5.2 on DeepInfraGLM-5.2 is Z-AI’s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding […]

Open vs Closed Source AI Models: Intelligence, Price & Speed ComparedThe LLM landscape in 2026 looks nothing like it did two years ago. Back then the assumption was simple: if you wanted the best model, you paid OpenAI or Anthropic, and that was that. Open source models were a respectable second tier, good for experimentation, fine-tuning, and budget workloads, but not quite there for serious […]

View all