GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

We are excited to announce that DeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano, the newest open reasoning model in the Nemotron family. Our goal is to give developers, researchers, and teams the fastest and simplest path to using Nemotron 3 Nano from day one — whether you are building lightweight agents, real-time analytics pipelines, or production-grade reasoning systems. On DeepInfra, Nano runs with zero setup, low latency, and no operational overhead, enabling you to move from idea to deployment in minutes.
With its balance of speed, accuracy, and predictable cost, 3 Nano is designed for real-world reasoning tasks. When paired with DeepInfra's high-efficiency inference platform and usage-based pricing, you can experiment freely, scale seamlessly, and integrate the model into your production workflows using only a few lines of code.
Nemotron 3 Nano introduces a hybrid architecture that blends Mixture of Experts (MoE) with the efficient Mamba transformer design. Most layers rely on Mamba for high-throughput sequence processing, while a focused subset of expert layers handles heavier reasoning operations. This enables:
To strengthen its reasoning capabilities, 3 Nano is trained on NVIDIA-curated synthetic reasoning datasets generated from expert models and aligned using reinforcement-learning methods to encourage more human-like thought patterns. Benchmarks results and third-party analysis confirm strong performance across:
Benchmark data shown below is based on independent evaluations by Artificial Analysis and is included for reference.


Source: Artificial Analysis
A key design principle of the Nemotron family including this model is openness: the weights, training data, and training recipes are available to the community. Teams can inspect, customize, or fine tune the model to fit research, product, or enterprise needs. This transparency aligns well with DeepInfra's mission to provide a predictable, developer-centric platform for running high-quality open models.
Nemotron 3 Nano supports a wide range of deployments—local hardware, cloud platforms, or NVIDIA NIM-based setups. On DeepInfra, the model is available through a fully managed endpoint, giving developers immediate access without navigating infrastructure provisioning or configuration.
Developers can expect:
To explore Nano's capabilities, use our ready-to-use Jupyter notebook. It's the fastest way to get started with working examples you can run immediately.
A hands-on guide showing how to run Nano, tune reasoning parameters, use long-context inputs, and build lightweight agentic workflows.
The nemotron-3-nano-tutorial.ipynb notebook walks through:
The notebook includes working code snippets you can copy and use immediately.
DeepInfra operates with a zero-retention policy. Inputs, outputs, and user data are not stored. The platform is SOC 2 and ISO 27001 certified, following industry best practices for security and privacy. More information is available in our Trust Center.
Visit the Nemotron 3 Nano model page on DeepInfra to explore pricing and start inference instantly, or check out our documentation to learn more about the broader model ecosystem and developer resources.
Have questions or need help? Reach out to us at feedback@deepinfra.com, join our Discord, or connect with us on X (@DeepInfra) - we're happy to help.
MiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost<p>About MiniMax-M2.5 MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through […]</p>
Qwen3 Coder 480B A35B API Benchmarks: Latency & Cost<p>About Qwen3 Coder 480B A35B Instruct Qwen3 Coder 480B A35B Instruct is a state-of-the-art large language model developed by the Qwen team at Alibaba Cloud, specifically designed for code generation and agentic coding tasks. It is a Mixture-of-Experts (MoE) model with 480 billion total parameters and 35 billion active parameters per inference, enabling high performance […]</p>
Qwen3.5 9B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 9B Qwen3.5 9B is the flagship of Alibaba’s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes […]</p>
© 2026 Deep Infra. All rights reserved.