We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

FLUX.2 is live! High-fidelity image generation made simple.

Long Context models incoming
Published on 2023.11.21 by Iskren Chernev
Long Context models incoming

Many users requested longer context models to help them summarize bigger chunks of text or write novels with ease.

We're proud to announce our long context model selection that will grow bigger in the comming weeks.

Models

Mistral-based models have a context size of 32k, and amazon recently released a model fine-tuned specifically on longer contexts.

We also recently released the highly praised Yi models. Keep in mind they don't support chat, just the old-school text completion (new models are in the works):

Context FAQ

  • What does context mean? - Context is the number of tokens the model can look at at the same time. In practise this is the limit of the sent tokens + max new tokens (2k default) that can be generated. There is no 1:1 correspondence between tokens and words, general rule-of-thumb is 100 tokens is 75 words.
  • Can I go above the context? - Technically -- no. Any API to a 4k model that lets you pass more will one way or another truncate the tokens given to the LLM to it's context size. You (as users) can also try reducing a long input (by removing some from the middle, for example) and re-submit.
  • My input fits the context size, but the model doesn't take it into account - The context size of a model is a hard creation-specified limit. Just because a model is listed as having a certain context size doesn't mean you can cram it with information and expect excellent results. Models differ in many parameters, and one such parameter is how far back they can recall information.Check MistalLite HF for some metrics shared by Amazon.
  • What can I do to make the model take into account more of the data I sent - There is no one answer to this question, and it varies by model, and each model capacity varies. The best you can do is test a particular model with a single task (like summarization, question answering), with different context lengths and find what works for you. Also try placing system prompt at the start and/or end. For most control you can utilize the text-completion endpoint (not the chat one).
  • Is a longer context model guaranteed to understand more context than a short-context model - Unfortunately -- no. For example a very bad model with huge context size may fail to act on a single sentence, whereas a good model with shorter context length could comprehend a few paragraphs or even pages.
  • I can't find a good model for my large context needs! - Don't loose hope! This is a rapidly evolving ecosystem with models being released every day. We try our best to provide the best open-source models to you, as quickly as possible. Let us know which models you like by emailing feedback@deepinfra.com or drop us a message in discord
Related articles
GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep InfraGLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra<p>GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the [&hellip;]</p>
Getting StartedGetting StartedGetting an API Key To use DeepInfra's services, you'll need an API key. You can get one by signing up on our platform. Sign up or log in to your DeepInfra account at deepinfra.com Navigate to the Dashboard and select API Keys Create a new ...
Kimi K2 0905 API from Deepinfra: Practical Speed, Predictable Costs, Built for Devs - Deep InfraKimi K2 0905 API from Deepinfra: Practical Speed, Predictable Costs, Built for Devs - Deep Infra<p>Kimi K2 0905 is Moonshot’s long-context Mixture-of-Experts update designed for agentic and coding workflows. With a context window up to ~256K tokens, it can ingest large codebases, multi-file documents, or long conversations and still deliver structured, high-quality outputs. But real-world performance isn’t defined by the model alone—it’s determined by the inference provider that serves it: [&hellip;]</p>