We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

🚀 New models by Bria.ai, generate and edit images at scale 🚀

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:

text-generation

automatic-speech-recognition

zero-shot-image-classification

black-forest-labs/

FLUX.1-Kontext-dev

black-forest-labs/FLUX.1-Kontext-dev cover image

FLUX.1 Kontext [dev] is a 12-billion-parameter image editing model that transforms visuals based on natural language instructions. It allows highly consistent, multi-step edits and is released with open weights under a non-commercial license to empower artists and researchers.

$0.01 x (width / 1024) x (height / 1024) x (iters / 25)

text-generation

deepseek-ai/DeepSeek-R1 cover image

We introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.

$0.70 in, $2.40 out / 1M

text-generation

DeepSeek-R1-Turbo

deepseek-ai/DeepSeek-R1-Turbo cover image

We introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.

$1.00 in, $3.00 out / 1M

deepseek-ai/Janus-Pro-1B cover image

Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus-Pro surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next-generation unified multimodal models.

$0.0005 / image

deepseek-ai/Janus-Pro-7B cover image

Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus-Pro surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next-generation unified multimodal models.

embeddinggemma-300m

google/embeddinggemma-300m cover image

ChatGPT said: EmbeddingGemma is a 300M parameter multilingual open embedding model from Google DeepMind, designed for efficient deployment even on low-resource devices, producing high-quality text vector representations for tasks such as search, classification, clustering, and semantic similarity.

$0.002 / 1M tokens

text-generation

gemini-1.5-flash

google/gemini-1.5-flash cover image

Gemini 1.5 Flash is Google's foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots. Gemini 1.5 Flash is designed for high-volume, high-frequency tasks where cost and latency matter.

text-generation

gemini-1.5-flash-8b

google/gemini-1.5-flash-8b cover image

text-generation

gemini-2.0-flash-001

google/gemini-2.0-flash-001 cover image

$0.10 in, $0.40 out / 1M

google/veo-3.0 cover image

Veo 3 is a state-of-the-art text-to-video model from Google that generates high-fidelity, cinematic videos with synchronized audio from a simple text prompt. It excels at creating realistic and imaginative scenes with a deep understanding of natural language and visual dynamics.

google/veo-3.0-fast cover image

Veo 3 Fast is a speed-optimized version of the Veo 3 model, designed for rapid video creation. While maintaining high quality, it delivers results in a fraction of the time, making it ideal for quick iterations and dynamic content generation.

google/veo-3.1 cover image

Veo 3.1 is the latest text-to-video model from Google that generates high-fidelity, cinematic videos with synchronized audio from a simple text prompt. It excels at creating realistic and imaginative scenes with a deep understanding of natural language and visual dynamics.

google/veo-3.1-fast cover image

Veo 3.1 is the latest text-to-video model from Google that generates high-fidelity, cinematic videos with synchronized audio from a simple text prompt. It excels at creating realistic and imaginative scenes with a deep understanding of natural language and visual dynamics.

intfloat/e5-base-v2 cover image

Text Embeddings by Weakly-Supervised Contrastive Pre-training. Model has 24 layers and 1024 out dim.

$0.005 / 1M tokens

intfloat/e5-large-v2 cover image

Text Embeddings by Weakly-Supervised Contrastive Pre-training. Model has 24 layers and 1024 out dim.

$0.010 / 1M tokens

multilingual-e5-large

intfloat/multilingual-e5-large cover image

The Multilingual-E5-large model is a 24-layer text embedding model with an embedding size of 1024, trained on a mixture of multilingual datasets and supporting 100 languages.

$0.010 / 1M tokens

multilingual-e5-large-instruct

intfloat/multilingual-e5-large-instruct cover image

The Multilingual-E5 models, initialized from XLM-RoBERTa, support up to 512 tokens per input — any longer text will be silently truncated. To ensure optimal performance, always prefix inputs with “query:” or “passage:”, as the model was explicitly trained with this format.

$0.0005 / second

text-generation

Llama-3.2-11B-Vision-Instruct

meta-llama/Llama-3.2-11B-Vision-Instruct cover image

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.

$0.049 / 1M tokens

text-generation

Llama-3.2-3B-Instruct

meta-llama/Llama-3.2-3B-Instruct cover image

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out)

$0.02 / 1M tokens

text-generation

Llama-Guard-3-8B

meta-llama/Llama-Guard-3-8B cover image

Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

text-generation

Llama-Guard-4-12B

meta-llama/Llama-Guard-4-12B cover image

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

text-generation

Meta-Llama-3-8B-Instruct

meta-llama/Meta-Llama-3-8B-Instruct cover image

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

$0.03 in, $0.06 out / 1M

text-generation

Meta-Llama-3.1-70B-Instruct

meta-llama/Meta-Llama-3.1-70B-Instruct cover image

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

$0.40 / 1M tokens

text-generation

Meta-Llama-3.1-70B-Instruct-Turbo

meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo cover image

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

$0.40 / 1M tokens

SOC 2 Certified

ISO 27001 Certified

Have questions or need a custom solution?

Company

Latest Models

anthropic/claude-3-7-sonnet-latest zai-org/GLM-4.6 deepseek-ai/DeepSeek-V3.2-Exp deepseek-ai/DeepSeek-V3.1 moonshotai/Kimi-K2-Instruct-0905

Featured Models

Qwen/Qwen3-Coder-480B-A35B-Instruct-Turbo deepseek-ai/DeepSeek-V3.1 Qwen/Qwen3-Coder-30B-A3B-Instruct zai-org/GLM-4.6 google/gemini-2.5-pro

Built With Love in Palo Alto

© 2025 Deep Infra. All rights reserved.

Privacy Policy Terms of Service