DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Real-time inference on DeepInfra is fast — but when a popular model is under heavy load, requests queue up and some get shed with an HTTP 429. The new Priority service tier lets your latency-critical traffic jump to the front of that queue and stay admitted through contention, for 1.5× the real-time price. It's a single OpenAI-compatible field on the request — no separate endpoint, no new API to learn.
Most traffic is happy to retry a 429 and move on. Some isn't. Priority is built for the workloads where waiting in line is the problem:
You opt in per request by setting service_tier to "priority". Leave it off and your request runs at the standard real-time rate, exactly as it does today — nothing changes for traffic that doesn't need to skip the line.
One field. Add "service_tier": "priority" to any chat or completions request. Priority requests:
service_tier field comes back "priority" when (and only when) priority was actually applied.When the model is idle, priority and normal requests look the same — there's no queue to jump. The difference shows up exactly when it matters: under load.
/v1/chat/completions/v1/completionsPriority-rated billing also applies to embeddings. Everything works through the standard OpenAI-compatible API you're already using.
Priority is billed at 1.5× the corresponding real-time price — applied automatically, with no extra configuration.
Here's the part worth reading twice: you only pay the priority rate when priority is actually delivered. If you request priority on a model that doesn't support it, the request is served normally, billed at the normal rate, and the response's service_tier comes back "default". What's billed always equals what's echoed — so you can verify exactly what you paid for by reading the service_tier field on the response. No silent upcharge for a tier you didn't get.
Priority is live now on models served on our vLLM stack, which covers the bulk of our text-generation catalog. Support for models served on SGLang and TensorRT-LLM is rolling out.
Priority is enabled on a per-model basis, and the set of priority-enabled models is growing continuously. You don't have to guess which ones qualify: every model that supports priority already carries a Priority tag on its model page, so you can see at a glance whether a model honors the tier before you send a request.
And you can always confirm it programmatically: send a request with service_tier="priority" and check the echoed service_tier on the response. If it comes back "priority", you're at the front of the queue; if it comes back "default", the model isn't priority-enabled yet and you weren't charged the premium.
Using the OpenAI Python client — just add service_tier="priority" and read it back off the response:
from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
resp = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Summarize this support ticket: ..."}],
service_tier="priority",
)
print(resp.choices[0].message.content)
print("served as:", resp.service_tier) # "priority" when priority was applied
Or with curl — service_tier is just another field in the JSON body:
curl https://api.deepinfra.com/v1/openai/chat/completions \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Summarize this support ticket: ..."}],
"service_tier": "priority"
}'
The service_tier field on the response tells you which tier actually served the request.
See the Service Tier documentation for the full reference, and start sending your latency-critical traffic to the front of the queue.
NVIDIA Nemotron 3 Super: Model Overview & Integration Guide<p>The NVIDIA Nemotron 3 Super is a state-of-the-art 120-billion parameter hybrid Mixture-of-Experts (MoE) model designed to bridge the gap between high-compute efficiency and extreme accuracy. Engineered specifically for the next generation of AI development, Nemotron 3 Super excels in multi-agent applications, specialized agentic systems, and complex reasoning tasks. By utilizing a sophisticated architecture that activates […]</p>
Introducing GLM-5.2 on DeepInfra<p>GLM-5.2 is Z-AI’s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding […]</p>
Deploy Custom LLMs on DeepInfraDid you just finetune your favorite model and are wondering where to run it?
Well, we have you covered. Simple API and predictable pricing.
Put your model on huggingface
Use a private repo, if you wish, we don't mind. Create a hf access token just
for the repo for better security.
Create c...© 2026 DeepInfra. All rights reserved.