We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

FLUX.2 is live! High-fidelity image generation made simple.

Llama Model Family

Developed by Meta, Llama (Large Language Model Meta AI) is a family of state-of-the-art open-weight models designed for efficiency and performance. The latest versions feature Mixture-of-Experts (MoE) architectures, enabling cost-effective inference by dynamically activating subsets of parameters. With support for multimodal inputs (text + images) and extended context windows (up to 10M tokens), Llama excels in tasks like code generation, multilingual understanding, and long-form reasoning. The models support FP8 quantization and batch inference, ensuring low-latency, high-throughput performance for production workloads. With permissive licensing and robust tooling (e.g., Llama Guard for safety), Llama is ideal for developers seeking powerful, customizable AI with minimal overhead.

Featured Model: meta-llama/Llama-4-Scout-17B-16E-Instruct

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

Price per 1M input tokens

$0.08

Price per 1M output tokens

$0.30

Release Date

04/5/2025

Context Size

327,680

Quantization

fp8

License Type

License

# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Featured Model: meta-llama/Llama-4-Maverick-17B-128E-Instruct-Turbo

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

Price per 1M input tokens

$0.50

Price per 1M output tokens

$0.50

Release Date

05/16/2025

Context Size

8,192

Quantization

fp8

License Type

License

# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-Turbo",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Available Llama 4 Models

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Llama-4-Scout-17B-16E	320k	$0.08	$0.30	View more
Llama-4-Maverick-17B-128E	1024k	$0.15	$0.60	View more
Llama-Guard-4-12B	160k	$0.18	$0.18	View more

Available Llama 3 Models

Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Llama-3.3-70B-Instruct-Turbo	128k	$0.10	$0.32	View more
Llama-3.2-11B-Vision-Instruct	128k	$0.049	$0.049	View more
Llama-3.2-3B-Instruct	128k	$0.02	$0.02	View more
Meta-Llama-3.1-70B-Instruct	128k	$0.40	$0.40	View more
Meta-Llama-3.1-70B-Instruct-Turbo	128k	$0.40	$0.40	View more
Meta-Llama-3.1-8B-Instruct	128k	$0.03	$0.05	View more
Meta-Llama-3.1-8B-Instruct-Turbo	128k	$0.02	$0.03	View more
Meta-Llama-3-8B-Instruct	8k	$0.03	$0.06	View more

FAQ

What is LLaMA AI?

LLaMA AI is Meta’s family of open-source foundational language models, encompassing advanced capabilities in text, chat, code, and multimodal (image + text) understanding. The latest generation—LLaMA 4—delivers state-of-the-art performance with efficient, scalable MoE architecture and context windows spanning thousands to millions of tokens. Meta provides extensive documentation, responsible-use guidelines, and a developer ecosystem—including cookbooks, tutorials, and a “Llama Everywhere” deployment guide—to support a wide range of use cases.

What tasks are LLaMA models best suited for?

LLaMA models are powerful general-purpose LLMs ideal for tasks like natural language generation, multilingual dialogue, programming assistance, document summarization, and image-language tasks. They also excel in enterprise applications like search augmentation, AI copilots, and automated reasoning systems.

Are the LLaMA models on DeepInfra optimized for low latency?

DeepInfra’s infrastructure deploys LLaMA models on high-performance GPUs (A100, H100, B200) with regional autoscaling, ensuring ultra-low latency and high throughput. This makes DeepInfra suitable for real-time applications where responsiveness and uptime are critical.

How large are the context windows for LLaMA models?

LLaMA models on DeepInfra support extended context windows ranging from 8k to 128k tokens, depending on the model variant. This makes them ideal for processing long documents, handling large conversation histories, or powering advanced RAG systems without truncation.

Can I use LLaMA models for multimodal (image + text) applications?

Yes. Several LLaMA models on DeepInfra include multimodal support, allowing them to process images and text in the same input. This enables use cases like visual Q&A, document intelligence, and screenshot analysis—all via a single API endpoint.

How do I integrate Llama models into my application?

You can integrate Llama models seamlessly using DeepInfra’s OpenAI-compatible API. Just replace your existing base URL with DeepInfra’s endpoint and use your DeepInfra API key—no infrastructure setup required. DeepInfra also supports integration through libraries like openai, litellm, and other SDKs, making it easy to switch or scale your workloads instantly.