We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:

text-generation

automatic-speech-recognition

zero-shot-image-classification

text-generation

Llama-Guard-4-12B

meta-llama/Llama-Guard-4-12B cover image

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

$0.18 / 1M tokens

text-generation

Meta-Llama-3.1-70B-Instruct-Turbo

meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo cover image

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

$0.40 / 1M tokens

text-generation

Meta-Llama-3.1-8B-Instruct-Turbo

meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo cover image

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

$0.02 in, $0.03 out / 1M

text-generation

microsoft/phi-4 cover image

Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

$0.07 in, $0.14 out / 1M

text-generation

Mistral-Nemo-Instruct-2407

mistralai/Mistral-Nemo-Instruct-2407 cover image

12B model trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size.

$0.02 in, $0.04 out / 1M

text-generation

Mistral-Small-24B-Instruct-2501

mistralai/Mistral-Small-24B-Instruct-2501 cover image

Mistral Small 3 is a 24B-parameter language model optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment. The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware.

$0.05 in, $0.08 out / 1M

text-generation

Mistral-Small-3.2-24B-Instruct-2506

mistralai/Mistral-Small-3.2-24B-Instruct-2506 cover image

Mistral-Small-3.2-24B-Instruct is a drop-in upgrade over the 3.1 release, with markedly better instruction following, roughly half the infinite-generation errors, and a more robust function-calling interface—while otherwise matching or slightly improving on all previous text and vision benchmarks.

$0.075 in, $0.20 out / 1M

speech-recognition

Voxtral-Mini-3B-2507

mistralai/Voxtral-Mini-3B-2507 cover image

Voxtral Mini is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

$0.00100 / minute

speech-recognition

Voxtral-Small-24B-2507

mistralai/Voxtral-Small-24B-2507 cover image

Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

$0.00300 / minute

nvidia/Cosmos3-Nano cover image

Cosmos3 is a world foundation model that unifies understanding and generation within a single Mixture-of-Transformer (MoT) architecture. Two tightly coupled towers—a Reasoner (vision-language model) and a Generator (world simulator)—share latent representations so that structured perception directly grounds realistic, temporally consistent simulation.

$0.0108 / second (480p)

nvidia/Cosmos3-Super cover image

Cosmos3 is a world foundation model that unifies understanding and generation within a single Mixture-of-Transformer (MoT) architecture. Two tightly coupled towers—a Reasoner (vision-language model) and a Generator (world simulator)—share latent representations so that structured perception directly grounds realistic, temporally consistent simulation.

$0.0432 / second (480p)

text-generation

Nemotron-3-Nano-30B-A3B

nvidia/Nemotron-3-Nano-30B-A3B cover image

NVIDIA Nemotron 3 Nano is an open small reasoning model optimized for fast, cost-efficient inference in agentic and production workloads. Built with a hybrid Mixture-of-Experts (MoE) and Mamba-Transformer architecture, it delivers strong multi-step reasoning, high token throughput, stable latency with predictable cost, and efficient deployment for agent-based systems. Designed for real-world AI systems where reasoning can generate significantly more tokens per prompt, Nemotron Nano reduces compute cost while maintaining strong reasoning quality.

$0.05 in, $0.20 out / 1M

speech-recognition

Nemotron-3.5-ASR-Streaming-Multilingual-0.6b

nvidia/Nemotron-3.5-ASR-Streaming-Multilingual-0.6b cover image

Nemotron 3.5 ASR Streaming Multilingual is an open 0.6B-parameter prompt-conditioned cache-aware FastConformer-RNNT model, engineered for low-latency streaming transcription across 40+ languages. It powers real-time captioning, voice agents, and multilingual transcription pipelines—replacing separate per-language Whisper deployments with a single inference pass.

$0.00020 / minute

text-generation

Nemotron-Content-Safety-3.5

nvidia/Nemotron-Content-Safety-3.5 cover image

Nemotron Content Safety 3.5 is a multimodal safety classifier developed by NVIDIA. A compact safety model that handles text, images, and custom policies. It outputs a safe/unsafe classification plus a reasoning trace, and can be used as an inference-time guardrail, as a judge for LLM safety testing and evaluation, or with the accompanying training dataset to post-train models for safer behavior.

$0.20 / 1M tokens

llama-nemotron-embed-vl-1b-v2

nvidia/llama-nemotron-embed-vl-1b-v2 cover image

The llama-nemotron-embed-vl-1b-v2 is a high-performance multimodal embedding model designed to transform text queries and document images into dense vector representations for advanced retrieval systems. It excels at understanding complex visual content like charts, tables, and infographics.

$0.010 / 1M tokens

llama-nemotron-rerank-vl-1b-v2

nvidia/llama-nemotron-rerank-vl-1b-v2 cover image

The llama-nemotron-rerank-vl-1b-v2 is a 1.7B parameter multimodal reranking model designed to evaluate and order the relevance of document images and text against specific user queries. It excels at understanding complex visual content like charts, tables, and infographics.

$0.010 / 1M tokens

image-classification

clip-vit-base-patch32

openai/clip-vit-base-patch32 cover image

The CLIP model was developed by OpenAI to investigate the robustness of computer vision models. It uses a Vision Transformer architecture and was trained on a large dataset of image-caption pairs. The model shows promise in various computer vision tasks but also has limitations, including difficulties with fine-grained classification and potential biases in certain applications.

$0.0005 / second

image-classification

clip-vit-large-patch14-336

openai/clip-vit-large-patch14-336 cover image

A zero-shot-image-classification model released by OpenAI. The clip-vit-large-patch14-336 model was trained from scratch on an unknown dataset and achieves unspecified results on the evaluation set. The model's intended uses and limitations, as well as its training and evaluation data, are not provided. The training procedure used an unknown optimizer and precision, and the framework versions included Transformers 4.21.3, TensorFlow 2.8.2, and Tokenizers 0.12.1.

$0.0005 / second

text-generation

openai/gpt-oss-120b cover image

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation.

$0.037 in, $0.17 out / 1M

text-generation

gpt-oss-120b-Turbo

openai/gpt-oss-120b-Turbo cover image

$0.15 in, $0.60 out / 1M

text-generation

openai/gpt-oss-20b cover image

gpt-oss-20b is an open-weight 21B parameter model released by OpenAI under the Apache 2.0 license. It uses a Mixture-of-Experts (MoE) architecture with 3.6B active parameters per forward pass, optimized for lower-latency inference. The model is trained in OpenAI’s Harmony response format and supports reasoning level configuration, fine-tuning, and agentic capabilities including function calling, tool use, and structured outputs.

$0.03 in, $0.14 out / 1M

speech-recognition

whisper-large-v3

openai/whisper-large-v3 cover image

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

$0.00045 / minute

speech-recognition

whisper-large-v3-turbo

openai/whisper-large-v3-turbo cover image

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

$0.00020 / minute

sentence-transformers/

all-MiniLM-L12-v2

sentence-transformers/all-MiniLM-L12-v2 cover image

We present a sentence transformation model that generates semantically similar sentences. Our model is based on the Sentence-Transformers architecture and was trained on a large dataset of sentence pairs. We evaluate the effectiveness of our model by measuring its ability to generate similar sentences that are close to the original sentence in meaning.

$0.005 / 1M tokens

SOC 2 Certified

ISO 27001 Certified

Have questions or need a custom solution?

Company

Latest Models

google/gemma-4-E4B-it tencent/Hy3 anthropic/claude-fable-5 anthropic/claude-sonnet-5 MiniMaxAI/MiniMax-M3

Featured Models

Qwen/Qwen3.6-35B-A3B Qwen/Qwen3-Max black-forest-labs/FLUX-2-klein-4b XiaomiMiMo/MiMo-V2.5 zai-org/GLM-4.7-Flash

Built With Love in Palo Alto

© 2026 DeepInfra. All rights reserved.

Privacy Policy Terms of Service