🚀 New models by Bria.ai, generate and edit images at scale 🚀
Developed by Meta, Llama (Large Language Model Meta AI) is a family of state-of-the-art open-weight models designed for efficiency and performance. The latest versions feature Mixture-of-Experts (MoE) architectures, enabling cost-effective inference by dynamically activating subsets of parameters. With support for multimodal inputs (text + images) and extended context windows (up to 10M tokens), Llama excels in tasks like code generation, multilingual understanding, and long-form reasoning. The models support FP8 quantization and batch inference, ensuring low-latency, high-throughput performance for production workloads. With permissive licensing and robust tooling (e.g., Llama Guard for safety), Llama is ideal for developers seeking powerful, customizable AI with minimal overhead.
The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts
Price per 1M input tokens
$0.08
Price per 1M output tokens
$0.30
Release Date
04/5/2025
Context Size
327,680
Quantization
bfloat16
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts
Price per 1M input tokens
$0.50
Price per 1M output tokens
$0.50
Release Date
05/16/2025
Context Size
8,192
Quantization
fp8
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-Turbo",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | Actions |
---|---|---|---|---|
Llama-4-Scout-17B-16E | 320k | $0.08 | $0.30 | |
Llama-4-Maverick-17B-128E | 1024k | $0.15 | $0.60 | |
Llama-4-Maverick-17B-128E-Turbo | 8k | $0.50 | $0.50 | |
Llama-Guard-4-12B | 160k | $0.18 | $0.18 |
Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | Actions |
---|---|---|---|---|
Llama-3.3-70B-Instruct | 128k | $0.23 | $0.40 | |
Llama-3.3-70B-Instruct-Turbo | 128k | $0.038 | $0.12 | |
Llama-3.2-11B-Vision-Instruct | 128k | $0.049 | $0.049 | |
Llama-3.2-3B-Instruct | 128k | $0.012 | $0.024 | |
Llama-3.2-1B-Instruct | 128k | $0.005 | $0.01 | |
Meta-Llama-3.1-405B-Instruct | 32k | $0.80 | $0.80 | |
Meta-Llama-3.1-70B-Instruct | 128k | $0.23 | $0.40 | |
Meta-Llama-3.1-70B-Instruct-Turbo | 128k | $0.10 | $0.28 | |
Meta-Llama-3.1-8B-Instruct | 128k | $0.03 | $0.05 | |
Meta-Llama-3.1-8B-Instruct-Turbo | 128k | $0.015 | $0.02 | |
Meta-Llama-3-70B-Instruct | 8k | $0.30 | $0.40 | |
Meta-Llama-3-8B-Instruct | 8k | $0.03 | $0.06 |
LLaMA AI is Meta’s family of open-source foundational language models, encompassing advanced capabilities in text, chat, code, and multimodal (image + text) understanding. The latest generation—LLaMA 4—delivers state-of-the-art performance with efficient, scalable MoE architecture and context windows spanning thousands to millions of tokens. Meta provides extensive documentation, responsible-use guidelines, and a developer ecosystem—including cookbooks, tutorials, and a “Llama Everywhere” deployment guide—to support a wide range of use cases.
LLaMA models are powerful general-purpose LLMs ideal for tasks like natural language generation, multilingual dialogue, programming assistance, document summarization, and image-language tasks. They also excel in enterprise applications like search augmentation, AI copilots, and automated reasoning systems.
DeepInfra’s infrastructure deploys LLaMA models on high-performance GPUs (A100, H100, B200) with regional autoscaling, ensuring ultra-low latency and high throughput. This makes DeepInfra suitable for real-time applications where responsiveness and uptime are critical.
LLaMA models on DeepInfra support extended context windows ranging from 8k to 128k tokens, depending on the model variant. This makes them ideal for processing long documents, handling large conversation histories, or powering advanced RAG systems without truncation.
Yes. Several LLaMA models on DeepInfra include multimodal support, allowing them to process images and text in the same input. This enables use cases like visual Q&A, document intelligence, and screenshot analysis—all via a single API endpoint.
© 2025 Deep Infra. All rights reserved.