We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:

text-generation

automatic-speech-recognition

zero-shot-image-classification

text-generation

Llama-4-Maverick-17B-128E-Instruct-FP8

meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 cover image

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

$0.20 in, $0.80 out / 1M

text-generation

Llama-4-Scout-17B-16E-Instruct

meta-llama/Llama-4-Scout-17B-16E-Instruct cover image

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

$0.10 in, $0.30 out / 1M

text-generation

Llama-Guard-4-12B

meta-llama/Llama-Guard-4-12B cover image

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

$0.18 / 1M tokens

text-generation

Meta-Llama-3.1-70B-Instruct-Turbo

meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo cover image

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

$0.40 / 1M tokens

text-generation

Meta-Llama-3.1-8B-Instruct-Turbo

meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo cover image

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

$0.02 in, $0.04 out / 1M

text-generation

microsoft/phi-4 cover image

Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

$0.07 in, $0.14 out / 1M

text-generation

Mistral-Nemo-Instruct-2407

mistralai/Mistral-Nemo-Instruct-2407 cover image

12B model trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size.

$0.019 in, $0.03 out / 1M

text-generation

Mistral-Small-24B-Instruct-2501

mistralai/Mistral-Small-24B-Instruct-2501 cover image

Mistral Small 3 is a 24B-parameter language model optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment. The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware.

$0.05 in, $0.08 out / 1M

text-generation

Mistral-Small-3.2-24B-Instruct-2506

mistralai/Mistral-Small-3.2-24B-Instruct-2506 cover image

Mistral-Small-3.2-24B-Instruct is a drop-in upgrade over the 3.1 release, with markedly better instruction following, roughly half the infinite-generation errors, and a more robust function-calling interface—while otherwise matching or slightly improving on all previous text and vision benchmarks.

$0.075 in, $0.20 out / 1M

text-generation

Nemotron-3-Nano-30B-A3B

nvidia/Nemotron-3-Nano-30B-A3B cover image

NVIDIA Nemotron 3 Nano is an open small reasoning model optimized for fast, cost-efficient inference in agentic and production workloads. Built with a hybrid Mixture-of-Experts (MoE) and Mamba-Transformer architecture, it delivers strong multi-step reasoning, high token throughput, stable latency with predictable cost, and efficient deployment for agent-based systems. Designed for real-world AI systems where reasoning can generate significantly more tokens per prompt, Nemotron Nano reduces compute cost while maintaining strong reasoning quality.

$0.05 in, $0.20 out / 1M

text-generation

Nemotron-Content-Safety-3.5

nvidia/Nemotron-Content-Safety-3.5 cover image

Nemotron Content Safety 3.5 is a multimodal safety classifier developed by NVIDIA. A compact safety model that handles text, images, and custom policies. It outputs a safe/unsafe classification plus a reasoning trace, and can be used as an inference-time guardrail, as a judge for LLM safety testing and evaluation, or with the accompanying training dataset to post-train models for safer behavior.

$0.20 / 1M tokens

text-generation

openai/gpt-oss-120b cover image

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation.

$0.037 in, $0.17 out / 1M

text-generation

gpt-oss-120b-Turbo

openai/gpt-oss-120b-Turbo cover image

$0.15 in, $0.60 out / 1M

text-generation

openai/gpt-oss-20b cover image

gpt-oss-20b is an open-weight 21B parameter model released by OpenAI under the Apache 2.0 license. It uses a Mixture-of-Experts (MoE) architecture with 3.6B active parameters per forward pass, optimized for lower-latency inference. The model is trained in OpenAI’s Harmony response format and supports reasoning level configuration, fine-tuning, and agentic capabilities including function calling, tool use, and structured outputs.

$0.03 in, $0.14 out / 1M

text-generation

stepfun-ai/Step-3.7-Flash cover image

Step 3.7 Flash is an open-source multimodal reasoning model by StepFun with 198B total parameters (11B active) using Mixture of Experts. It accepts text and image inputs and features a 256K context window, selectable reasoning effort, tool calling, and agentic capabilities for coding and search workflows, scoring 80.9% on GPQA Diamond and 56.3% on SWE-bench Pro.

$0.04 cached, $0.20 in, $1.15 out / 1M

text-generation

tencent/Hy3 cover image

Hy3 is a 295B-parameter Mixture-of-Experts (MoE) model with 21B active parameters and 3.8B MTP layer parameters, developed by the Tencent Hy Team. Following the Hy3 Preview launch in late April, we gathered feedback from 50+ products and scaled up post-training with higher quality data. Today, we introduce Hy3, which outperforms similar-size models and rivals flagship open-source models with 2-5x parameters. It also shows significant gains in utility across various products and productivity tasks.

$0.035 cached, $0.14 in, $0.58 out / 1M

text-generation

thinkingmachines/

thinkingmachines/Inkling cover image

Inkling is a general-purpose multimodal model that accepts text, image and audio inputs and generates text outputs.

$0.17 cached, $1.00 in, $4.05 out / 1M

text-generation

zai-org/GLM-4.6 cover image

Compared with GLM-4.5, GLM-4.6 brings several key improvements: Longer context window: The context window has been expanded from 128K to 200K tokens, enabling the model to handle more complex agentic tasks. Superior coding performance: The model achieves higher scores on code benchmarks and demonstrates better real-world performance in applications such as Claude Code、Cline、Roo Code and Kilo Code, including improvements in generating visually polished front-end pages. Advanced reasoning: GLM-4.6 shows a clear improvement in reasoning performance and supports tool use during inference, leading to stronger overall capability. More capable agents: GLM-4.6 exhibits stronger performance in tool using and search-based agents, and integrates more effectively within agent frameworks. Refined writing: Better aligns with human preferences in style and readability, and performs more naturally in role-playing scenarios.

$0.10 cached, $0.50 in, $2.00 out / 1M

text-generation

zai-org/GLM-4.7 cover image

GLM-4.7 is a state-of-the-art, multilingual Mixture-of-Experts (MoE) language model designed for complex reasoning, agentic coding, and tool use. Building on its predecessor GLM-4.6, it delivers significant improvements across key benchmarks, including multilingual SWE-bench, Terminal Bench, and reasoning-heavy evaluations like HLE. The model features advanced "Interleaved Thinking" and new "Preserved Thinking" modes, allowing it to reason before actions and maintain consistency across long, multi-turn tasks. With 358 billion parameters, GLM-4.7 excels in generating clean code, modern UI elements, and sophisticated reasoning outputs.

$0.08 cached, $0.40 in, $1.75 out / 1M

SOC 2 Certified

ISO 27001 Certified

Have questions or need a custom solution?

Company

Latest Models

thinkingmachines/Inkling Qwen/Qwen3-ASR-1.7B Qwen/Qwen3-ASR-0.6B nvidia/Nemotron-3-Embed-8B nvidia/Nemotron-3-Embed-1B-NVFP4

Featured Models

moonshotai/Kimi-K2.6 moonshotai/Kimi-K2.7-Code zai-org/GLM-4.7-Flash deepseek-ai/DeepSeek-V4-Flash black-forest-labs/FLUX-2-klein-9b

Built With Love in Palo Alto

© 2026 DeepInfra. All rights reserved.

Privacy Policy Terms of Service