GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Many users requested longer context models to help them summarize bigger chunks of text or write novels with ease.
We're proud to announce our long context model selection that will grow bigger in the comming weeks.
Mistral-based models have a context size of 32k, and amazon recently released a model fine-tuned specifically on longer contexts.
We also recently released the highly praised Yi models. Keep in mind they don't support chat, just the old-school text completion (new models are in the works):
Pricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is […]</p>
Kimi K2 0905 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2 0905 Kimi K2 0905 is a state-of-the-art large language model developed by Moonshot AI, representing a significant advancement in open-weight AI capabilities. This Mixture-of-Experts (MoE) model features 1 trillion total parameters with 32 billion activated parameters per forward pass, making it highly efficient while maintaining frontier-level performance. The model supports a 256k […]</p>
Step 3.5 Flash API Benchmarks: Latency, Throughput & Cost<p>About Step 3.5 Flash Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models. […]</p>
© 2026 Deep Infra. All rights reserved.