DeepSeek's models are a suite of advanced AI systems that prioritize efficiency, scalability, and real-world applicability.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
DeepSeek-R1 | 160k | $0.45 | $2.15 | |
DeepSeek-R1-0528 | 160k | $0.50 | $2.15 | |
DeepSeek-R1-Turbo | 32k | $1.00 | $3.00 | |
DeepSeek-V3-0324 | 160k | $0.30 | $0.88 | |
DeepSeek-V3 | 160k | $0.38 | $0.89 | |
DeepSeek-Prover-V2-671B | 160k | $0.50 | $2.18 | |
DeepSeek-R1-Distill-Llama-70B | 128k | $0.10 | $0.40 | |
DeepSeek-R1-Distill-Qwen-32B | 128k | $0.12 | $0.18 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
DeepSeek-R1 | 160k | $0.45 | $2.15 | |
DeepSeek-R1-0528 | 160k | $0.50 | $2.15 | |
DeepSeek-R1-Turbo | 32k | $1.00 | $3.00 | |
DeepSeek-V3-0324 | 160k | $0.30 | $0.88 | |
DeepSeek-V3 | 160k | $0.38 | $0.89 | |
DeepSeek-Prover-V2-671B | 160k | $0.50 | $2.18 | |
DeepSeek-R1-Distill-Llama-70B | 128k | $0.10 | $0.40 | |
DeepSeek-R1-Distill-Qwen-32B | 128k | $0.12 | $0.18 |
The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Llama-4-Scout-17B-16E | 320k | $0.08 | $0.30 | |
Llama-4-Maverick-17B-128E | 1024k | $0.15 | $0.60 | |
Llama-4-Maverick-17B-128E-Turbo | 8k | $0.50 | $0.50 | |
Llama-Guard-4-12B | 160k | $0.05 | $0.05 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Llama-4-Scout-17B-16E | 320k | $0.08 | $0.30 | |
Llama-4-Maverick-17B-128E | 1024k | $0.15 | $0.60 | |
Llama-4-Maverick-17B-128E-Turbo | 8k | $0.50 | $0.50 | |
Llama-Guard-4-12B | 160k | $0.05 | $0.05 |
Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Llama-3.3-70B-Instruct | 128k | $0.23 | $0.40 | |
Llama-3.3-70B-Instruct-Turbo | 128k | $0.05 | $0.25 | |
Llama-3.2-11B-Vision-Instruct | 128k | $0.049 | $0.049 | |
Llama-3.2-3B-Instruct | 128k | $0.01 | $0.02 | |
Llama-3.2-1B-Instruct | 128k | $0.005 | $0.01 | |
Meta-Llama-3.1-405B-Instruct | 32k | $0.80 | $0.80 | |
Meta-Llama-3.1-70B-Instruct | 128k | $0.23 | $0.40 | |
Meta-Llama-3.1-70B-Instruct-Turbo | 128k | $0.10 | $0.28 | |
Meta-Llama-3.1-8B-Instruct | 128k | $0.03 | $0.05 | |
Meta-Llama-3.1-8B-Instruct-Turbo | 128k | $0.016 | $0.03 | |
Meta-Llama-3-70B-Instruct | 8k | $0.30 | $0.40 | |
Meta-Llama-3-8B-Instruct | 8k | $0.03 | $0.06 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Llama-3.3-70B-Instruct | 128k | $0.23 | $0.40 | |
Llama-3.3-70B-Instruct-Turbo | 128k | $0.05 | $0.25 | |
Llama-3.2-11B-Vision-Instruct | 128k | $0.049 | $0.049 | |
Llama-3.2-3B-Instruct | 128k | $0.01 | $0.02 | |
Llama-3.2-1B-Instruct | 128k | $0.005 | $0.01 | |
Meta-Llama-3.1-405B-Instruct | 32k | $0.80 | $0.80 | |
Meta-Llama-3.1-70B-Instruct | 128k | $0.23 | $0.40 | |
Meta-Llama-3.1-70B-Instruct-Turbo | 128k | $0.10 | $0.28 | |
Meta-Llama-3.1-8B-Instruct | 128k | $0.03 | $0.05 | |
Meta-Llama-3.1-8B-Instruct-Turbo | 128k | $0.016 | $0.03 | |
Meta-Llama-3-70B-Instruct | 8k | $0.30 | $0.40 | |
Meta-Llama-3-8B-Instruct | 8k | $0.03 | $0.06 |
Qwen series offers a comprehensive suite of dense and mixture-of-experts models.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
QwQ-32B | 128k | $0.15 | $0.20 | |
Qwen3-235B-A22B | 40k | $0.13 | $0.60 | |
Qwen3-32B | 40k | $0.10 | $0.30 | |
Qwen3-30B-A3B | 40k | $0.08 | $0.29 | |
Qwen3-14B | 40k | $0.06 | $0.24 | |
Qwen2.5-72B-Instruct | 32k | $0.12 | $0.39 | |
Qwen2.5-Coder-32B-Instruct | 32k | $0.06 | $0.15 | |
Qwen2.5-7B-Instruct | 32k | $0.04 | $0.10 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
QwQ-32B | 128k | $0.15 | $0.20 | |
Qwen3-235B-A22B | 40k | $0.13 | $0.60 | |
Qwen3-32B | 40k | $0.10 | $0.30 | |
Qwen3-30B-A3B | 40k | $0.08 | $0.29 | |
Qwen3-14B | 40k | $0.06 | $0.24 | |
Qwen2.5-72B-Instruct | 32k | $0.12 | $0.39 | |
Qwen2.5-Coder-32B-Instruct | 32k | $0.06 | $0.15 | |
Qwen2.5-7B-Instruct | 32k | $0.04 | $0.10 |
Gemma is a family of lightweight, state-of-the-art open models from Google.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
gemma-3-27b-it | 128k | $0.10 | $0.20 | |
gemma-3-12b-it | 128k | $0.05 | $0.10 | |
gemma-3-4b-it | 128k | $0.02 | $0.04 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
gemma-3-27b-it | 128k | $0.10 | $0.20 | |
gemma-3-12b-it | 128k | $0.05 | $0.10 | |
gemma-3-4b-it | 128k | $0.02 | $0.04 |
Phi models offer cost-effective, high-performance AI solutions.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
phi-4 | 16k | $0.07 | $0.14 | |
Phi-4-multimodal-instruct | 128k | $0.05 | $0.10 | |
phi-4-reasoning-plus | 32k | $0.07 | $0.35 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
phi-4 | 16k | $0.07 | $0.14 | |
Phi-4-multimodal-instruct | 128k | $0.05 | $0.10 | |
phi-4-reasoning-plus | 32k | $0.07 | $0.35 |
Mixture of expert models split the computations into multiple expert subnetworks providing a strong performance.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Mixtral-8x7B-Instruct-v0.1 | 32k | $0.08 | $0.24 | |
WizardLM-2-8x22B | 64k | $0.48 | $0.48 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Mixtral-8x7B-Instruct-v0.1 | 32k | $0.08 | $0.24 | |
WizardLM-2-8x22B | 64k | $0.48 | $0.48 |
Less than 10 billion parameters
Our fastest and best value models but they might not be so precise.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Meta-Llama-3-8B-Instruct | 8k | $0.03 | $0.06 | |
Mistral-7B-Instruct-v0.3 | 32k | $0.028 | $0.054 | |
Meta-Llama-3.1-8B-Instruct | 128k | $0.03 | $0.05 | |
gemma-3-4b-it | 128k | $0.02 | $0.04 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Meta-Llama-3-8B-Instruct | 8k | $0.03 | $0.06 | |
Mistral-7B-Instruct-v0.3 | 32k | $0.028 | $0.054 | |
Meta-Llama-3.1-8B-Instruct | 128k | $0.03 | $0.05 | |
gemma-3-4b-it | 128k | $0.02 | $0.04 |
Between 10 and 70 billion parameters
Models that are fine-tuned for a balance between speed and precision.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
MythoMax-L2-13b | 4k | $0.065 | $0.065 | |
gemma-3-27b-it | 128k | $0.10 | $0.20 | |
gemma-3-12b-it | 128k | $0.05 | $0.10 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
MythoMax-L2-13b | 4k | $0.065 | $0.065 | |
gemma-3-27b-it | 128k | $0.10 | $0.20 | |
gemma-3-12b-it | 128k | $0.05 | $0.10 |
Models are our most capable models capable of handling complex tasks but also our most expensive and might be slower to respond.
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Meta-Llama-3-70B-Instruct | 8k | $0.30 | $0.40 | |
Meta-Llama-3.1-70B-Instruct | 128k | $0.23 | $0.40 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens | actions |
---|---|---|---|---|
Meta-Llama-3-70B-Instruct | 8k | $0.30 | $0.40 | |
Meta-Llama-3.1-70B-Instruct | 128k | $0.23 | $0.40 |
You can deploy your own model on our hardware and pay for uptime. You get dedicated SXM-connected GPUs (for multi-GPU setups), automatic scaling to handle load fluctuations and a very competitive price. Read More
Dedicated A100-80GB, H100-80GB & H200-141GB GPUs for your custom LLM needs
Billed in minute granularity
Invoiced weekly
GPU | Price |
---|---|
Nvidia A100 GPU | $1.50/GPU-hour |
Nvidia H100 GPU | $2.40/GPU-hour |
Nvidia H200 GPU | $3.00/GPU-hour |
For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com
Model | Context | $ per 1M input tokens |
---|---|---|
bge-base-en-v1.5 | 512 | $0.005 |
bge-en-icl | 8k | $0.01 |
bge-large-en-v1.5 | 512 | $0.01 |
bge-m3 | 8k | $0.01 |
bge-m3-multi | 8k | $0.01 |
gte-base | 512 | $0.005 |
gte-large | 512 | $0.01 |
e5-base-v2 | 512 | $0.005 |
e5-large-v2 | 512 | $0.01 |
multilingual-e5-large | 512 | $0.01 |
all-MiniLM-L12-v2 | 512 | $0.005 |
all-MiniLM-L6-v2 | 512 | $0.005 |
all-mpnet-base-v2 | 512 | $0.005 |
multi-qa-mpnet-base-dot-v1 | 512 | $0.005 |
paraphrase-MiniLM-L6-v2 | 512 | $0.005 |
text2vec-base-chinese | 512 | $0.005 |
All models run on H100 or A100 GPUs, optimized for inference performance and low latency.
Our system will automatically scale the model to more hardware based on your needs. We limit each account to 200 concurrent requests. If you want more drop us a line
You have to add a card or pre-pay or you won't be able to use our services. An invoice is always generated at the beginning of the month, and also throughout the month if you hit your tier invoicing threshold. You can also set a spending limit to avoid surprises.
Every user is part of a usage tier. As your usage and your spending goes up, we automatically move you to the next usage tier. Every tier has an invoicing threshold. Once reached an invoice is automatically generated.
Tier | Qualification & Invoicing Threshold | $ |
---|---|---|
Tier 1 | $20 | |
Tier 2 | $100 paid | $100 |
Tier 3 | $500 paid | $500 |
Tier 4 | $2,000 paid | $2,000 |
Tier 5 | $10,000 paid | $10,000 |