We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Introducing the Priority Service Tier: Front-of-Queue Inference When It Counts
Published on 2026.06.29 by DeepInfra
Introducing the Priority Service Tier: Front-of-Queue Inference When It Counts

Real-time inference on DeepInfra is fast — but when a popular model is under heavy load, requests queue up and some get shed with an HTTP 429. The new Priority service tier lets your latency-critical traffic jump to the front of that queue and stay admitted through contention, for 1.5× the real-time price. It's a single OpenAI-compatible field on the request — no separate endpoint, no new API to learn.

Why Priority?

Most traffic is happy to retry a 429 and move on. Some isn't. Priority is built for the workloads where waiting in line is the problem:

  • Interactive, user-facing apps where time-to-first-token is the product — chat, autocomplete, voice.
  • Agentic and multi-step pipelines where every hop's latency compounds into a slow end-to-end run.
  • Revenue-critical traffic you can't afford to have shed during peaks.

You opt in per request by setting service_tier to "priority". Leave it off and your request runs at the standard real-time rate, exactly as it does today — nothing changes for traffic that doesn't need to skip the line.

How It Works

One field. Add "service_tier": "priority" to any chat or completions request. Priority requests:

  • Jump to the front of the engine's scheduling queue — lower time-to-first-token when the model is busy.
  • Get protected admission — they keep being accepted while normal-tier traffic begins to see 429s under contention.
  • Echo the tier back — the response's service_tier field comes back "priority" when (and only when) priority was actually applied.

When the model is idle, priority and normal requests look the same — there's no queue to jump. The difference shows up exactly when it matters: under load.

Supported Endpoints

  • /v1/chat/completions
  • /v1/completions

Priority-rated billing also applies to embeddings. Everything works through the standard OpenAI-compatible API you're already using.

Pricing

Priority is billed at 1.5× the corresponding real-time price — applied automatically, with no extra configuration.

Here's the part worth reading twice: you only pay the priority rate when priority is actually delivered. If you request priority on a model that doesn't support it, the request is served normally, billed at the normal rate, and the response's service_tier comes back "default". What's billed always equals what's echoed — so you can verify exactly what you paid for by reading the service_tier field on the response. No silent upcharge for a tier you didn't get.

Supportability Today

Priority is live now on models served on our vLLM stack, which covers the bulk of our text-generation catalog. Support for models served on SGLang and TensorRT-LLM is rolling out.

Priority is enabled on a per-model basis, and the set of priority-enabled models is growing continuously. You don't have to guess which ones qualify: every model that supports priority already carries a Priority tag on its model page, so you can see at a glance whether a model honors the tier before you send a request.

And you can always confirm it programmatically: send a request with service_tier="priority" and check the echoed service_tier on the response. If it comes back "priority", you're at the front of the queue; if it comes back "default", the model isn't priority-enabled yet and you weren't charged the premium.

Get Started

Using the OpenAI Python client — just add service_tier="priority" and read it back off the response:

from openai import OpenAI

client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

resp = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Summarize this support ticket: ..."}],
    service_tier="priority",
)

print(resp.choices[0].message.content)
print("served as:", resp.service_tier)  # "priority" when priority was applied
copy

Or with curl — service_tier is just another field in the JSON body:

curl https://api.deepinfra.com/v1/openai/chat/completions \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Summarize this support ticket: ..."}],
    "service_tier": "priority"
  }'
copy

The service_tier field on the response tells you which tier actually served the request.

Skip the line

See the Service Tier documentation for the full reference, and start sending your latency-critical traffic to the front of the queue.

Related articles
NVIDIA Nemotron 3 Super: Model Overview & Integration GuideNVIDIA Nemotron 3 Super: Model Overview & Integration Guide<p>The NVIDIA Nemotron 3 Super is a state-of-the-art 120-billion parameter hybrid Mixture-of-Experts (MoE) model designed to bridge the gap between high-compute efficiency and extreme accuracy. Engineered specifically for the next generation of AI development, Nemotron 3 Super excels in multi-agent applications, specialized agentic systems, and complex reasoning tasks. By utilizing a sophisticated architecture that activates [&hellip;]</p>
Introducing GLM-5.2 on DeepInfraIntroducing GLM-5.2 on DeepInfra<p>GLM-5.2 is Z-AI&#8217;s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding [&hellip;]</p>
Deploy Custom LLMs on DeepInfraDeploy Custom LLMs on DeepInfraDid you just finetune your favorite model and are wondering where to run it? Well, we have you covered. Simple API and predictable pricing. Put your model on huggingface Use a private repo, if you wish, we don't mind. Create a hf access token just for the repo for better security. Create c...