MiniMax M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).

MiniMax-M2.5

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Kimi-K2.5

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

GLM-4.7-Flash

NVIDIA Nemotron 3 Nano is an open reasoning model optimized for fast, cost-efficient inference. Built with a hybrid MoE and Mamba architecture and trained on NVIDIA-curated synthetic reasoning data, it delivers strong multi-step reasoning with stable latency and predictable performance for agentic and production workloads.

Nemotron-3-Nano-30B-A3B

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.

DeepSeek-V3.2

🥳 For a limited time, Fibo Edit is free on DeepInfra 🥳  YOUR AI, YOUR RULES. Visual Generation for Production-Grade. FIBO Edit. An open-source image editing model with native masking and a lightweight 8B architecture.

fibo_edit

Remove unwanted objects or regions from video using a mask, reconstructs the background with intelligent content-aware fill.

video_eraser

Automatically identify and segment foreground objects across video frames and generate a mask. No prompts, just a video.

video_foreground_mask

Increase video resolution up to 8K with advanced AI upscaling. Bring your videos to the big screen, ready for the screens of tomorrow.

video_increase_resolution

Identify and segment objects across video frames using specific coordinate points. Just point in the right direction and the model will figure out by itself which object should be masked.

video_mask_by_key_points

Identify and segment objects across video frames using a text prompt. The easiest way to create a mask to modify your videos.

video_mask_by_prompt

Light and fast. Remove the background of your videos to bring the foreground elements to focus. No more unwanted distractions.

video_remove_background

P-Image is a state-of-the-art real-time generation model  with exceptional text rendering, fine-detail accuracy, and rock-solid prompt adherence. It’s built for instant creativity at high-fidelity images in about one second at a fraction of typical model costs.

p-image

P-Image-Edit is a high-precision image editing model that applies complex transformations, insertions, removals, and style adjustments in under a second. It delivers state-of-the-art accuracy, clean boundaries, and reliable prompt alignment, making multi-step edits fast, consistent, and production-ready.

p-image-Edit

HiggsAudioV2.5 is a high-quality neural text-to-speech (TTS) model designed for natural-sounding voice generation across a wide range of use cases. It focuses on clarity, stable prosody, and consistent pacing, making it suitable for both short prompts and longer narration.

HiggsAudioV2.5

Chatterbox is a family of three state-of-the-art, open-source text-to-speech models by Resemble AI.  We are excited to introduce Chatterbox-Turbo, our most efficient model yet. Built on a streamlined 350M parameter architecture, Turbo delivers high-quality speech with less compute and VRAM than our previous models. We have also distilled the speech-token-to-mel decoder, previously a bottleneck, reducing generation from 10 steps to just one, while retaining high-fidelity audio output.  Paralinguistic tags are now native to the Turbo model, allowing you to use [cough], [laugh], [chuckle], and more to add distinct realism. While Turbo was built primarily for low-latency voice agents, it excels at narration and creative workflows.  If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (link).

chatterbox-turbo

The fastest model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-4b

The best quality-to-latency ratio, production apps model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-9b

Developed by Anthropic, Claude is a family of highly performant, trustworthy AI models built for complex reasoning, advanced coding, and nuanced language understanding. The latest Claude 4 generation delivers breakthrough capabilities in analytical thinking, with Claude 4 Opus setting new standards for intelligence and Claude 4 Sonnet providing exceptional performance with remarkable efficiency.

Claude models excel at understanding context, following complex instructions, and maintaining coherent conversations across extended interactions. With advanced features like extended thinking for deeper reasoning, prompt caching that reduces costs by up to 90%, vision capabilities for image analysis, and robust safety measures, Claude is designed for enterprise applications that demand both sophistication and reliability.

Available with comprehensive API features including streaming responses, batch processing for 50% cost savings, multilingual support across dozens of languages, and flexible context windows up to 200K tokens (1M in beta), Claude is perfect for building intelligent applications like customer support agents, content analysis systems, coding assistants, and complex reasoning workflows that require both accuracy and trustworthiness.

Claude AI family: Claude 4 Opus for complex reasoning, Claude 4 Sonnet for balanced performance, plus advanced capabilities like extended thinking, prompt caching, vision analysis, and enterprise-grade safety APIs.

Claude AI APIs via DeepInfra

claude

Developed by Anthropic, Claude is a family of highly performant, trustworthy AI models built for complex reasoning, advanced coding, and nuanced language understanding

DeepInfra provides access to Anthropic's latest Claude models, featuring the most advanced reasoning capabilities and balanced performance options, all with enterprise-grade safety and reliability.

Claude

DeepSeek develops advanced foundation models optimized for computational efficiency and strong generalization across diverse tasks. The architecture incorporates recent advances in transformer-based systems, delivering robust performance in both zero-shot and fine-tuned scenarios. Models are pretrained on rigorously filtered multilingual corpora with specialized optimizations for mathematical reasoning and algorithmic tasks. The inference stack achieves competitive throughput while maintaining low latency, making it suitable for production deployment. Researchers and engineers can leverage these models for tasks ranging from natural language processing to complex analytical problem-solving.

deepseek

DeepSeek's models are a suite of advanced AI systems that prioritize efficiency, scalability, and real-world applicability.

DeepSeek

Developed by Black Forest Labs (the original creators behind Stable Diffusion), Flux is a family of state-of-the-art image generation and editing models that deliver exceptional visual quality with breakthrough prompt accuracy and photorealism. Built on advanced 12 billion parameter architecture, Flux models excel at understanding exactly what you want to create or modify.

The revolutionary FLUX.1 Kontext introduces game-changing image editing capabilities—simply describe what you want to change in an existing image, and it makes precise modifications while keeping everything else intact. Character faces, lighting, and composition remain consistent across multiple edits, enabling truly iterative creative workflows.

Flux offers specialized variants for every need: Pro delivers maximum quality, Dev provides open-weight flexibility for research, Schnell generates images in just 1-4 steps for rapid iteration, plus dedicated editing tools for specific tasks. Available from open-source to commercial licensing, Flux is perfect for developers building creative applications, product visualization tools, and next-generation image editing experiences.


Flux AI image generation family: FLUX.1 Kontext for in-context editing, FLUX.1 Pro/Dev for text-to-image synthesis, plus comprehensive editing tools and state-of-the-art visual generation APIs.

Flux Image Generation APIs via DeepInfra

flux

Developed by Black Forest Labs, Flux is a family of state-of-the-art image generation and editing models that deliver exceptional visual quality with breakthrough prompt accuracy and photorealism.

DeepInfra provides access to Black Forest Labs' complete Flux ecosystem, offering everything from lightning-fast generation to sophisticated in-context editing capabilities with industry-leading prompt adherence and visual quality.

Flux

Developed by Google DeepMind, Gemini is a family of state-of-the-art thinking models with native multimodal capabilities, designed for advanced reasoning, complex problem-solving, and comprehensive understanding across text, audio, video, and images. Built with revolutionary thinking architecture, Gemini models reason through problems step-by-step before responding, delivering enhanced accuracy and performance for sophisticated applications.

Gemini 2.5 Pro sets new standards for complex reasoning and coding excellence, while Gemini 2.5 Flash provides optimal price-performance for high-volume tasks. With massive context windows up to 1 million tokens, native multimodal processing that handles hours of video and audio, and transparent reasoning capabilities that show step-by-step thinking processes, Gemini excels at document analysis, code generation, scientific research, and agentic workflows.

Perfect for building intelligent applications that require deep reasoning, multimodal understanding, long-context processing, and transparent AI decision-making with Google's enterprise-grade reliability and performance.


Gemini AI family: Advanced thinking models with native multimodal processing for text, audio, video, and image understanding APIs

Gemini AI Model APIs via DeepInfra

gemini

Developed by Google DeepMind, Gemini is a family of state-of-the-art thinking models with native multimodal capabilities

DeepInfra provides access to Google's latest Gemini models, featuring advanced thinking capabilities, native multimodal processing, and industry-leading performance for complex reasoning and development tasks.

Gemini

Developed by Meta, Llama (Large Language Model Meta AI) is a family of state-of-the-art open-weight models designed for efficiency and performance. The latest versions feature Mixture-of-Experts (MoE) architectures, enabling cost-effective inference by dynamically activating subsets of parameters. With support for multimodal inputs (text + images) and extended context windows (up to 10M tokens), Llama excels in tasks like code generation, multilingual understanding, and long-form reasoning. The models support FP8 quantization and batch inference, ensuring low-latency, high-throughput performance for production workloads. With permissive licensing and robust tooling (e.g., Llama Guard for safety), Llama is ideal for developers seeking powerful, customizable AI with minimal overhead.

llama

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

Llama 4

Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.

Llama 3

Llama

Developed by Mistral AI, a leading French research lab, Mistral is a family of open-source AI models built for multilingual excellence, advanced reasoning, and cost-effective performance. These models excel at complex reasoning, mathematics, coding, and specialized tasks while offering complete transparency and deployment freedom through open-source licensing.

Mistral Small 3.2 delivers breakthrough efficiency with native fluency in European languages, while specialized variants handle specific needs: Devstral for coding, Voxtral for audio processing, and Mixtral for high-performance tasks. With Apache 2.0 licensing, extensive context windows up to 128K tokens, and comprehensive customization options, Mistral provides enterprise-grade capabilities without vendor lock-in.

Perfect for building multilingual applications, coding assistants, and reasoning systems where you need both powerful performance and complete control over your AI deployment.

Mistral AI model family: Mistral Small 3.2 for efficient performance, Mixtral for specialized tasks, Devstral for coding, plus multilingual reasoning, mathematics, and open-source flexibility APIs.

Mistral AI Model APIs via DeepInfra

mistral

Developed by Mistral AI, a leading French research lab, Mistral is a family of open-source AI models built for multilingual excellence, advanced reasoning, and cost-effective performance

DeepInfra provides access to Mistral AI's comprehensive open-source model ecosystem, from efficient small models to specialized coding and audio processing variants, all with complete Apache 2.0 licensing freedom.

Mistral

Voxtral is a family of audio models with state-of-the-art speech to text capabilities.

Voxtral

The Nemotron family is a group of large language models developed by NVIDIA, specifically engineered to excel at generating high-quality synthetic data for training other, more powerful AI models. Unlike models focused solely on end-user chat or content creation, Nemotron's core strength lies in producing diverse and realistic text-based training examples—including question-answer pairs, instructions, and conversations—that are crucial for the "supervised fine-tuning" stage of AI development. By providing a robust toolkit for creating these datasets, Nemotron acts as a powerful "force multiplier" in the AI training pipeline, enabling developers to build more capable and refined specialized models efficiently and at scale, without relying solely on scarce, human-curated data.

nemotron

NVIDIA Nemotron is a family of open models customized for efficiency, accuracy, and specialized workloads.

Nemotron

Developed by Alibaba Group's Qwen Team, Qwen is a family of state-of-the-art large language and multimodal models designed for comprehensive AI capabilities and multilingual performance. The latest Qwen3 generation features balanced model architectures including reintroduced Mixture-of-Experts (MoE) variants (Qwen3-30B-A3B and Qwen3-235B-A22B) alongside dense models up to 32B parameters, enabling efficient resource utilization through dynamic parameter activation. 

With support for 119 languages and dialects, hybrid thinking modes that seamlessly alternate between reasoning and instruction-following without model switching, and extended context windows (up to 1M tokens in Qwen3-2507), Qwen excels in tasks like multilingual understanding, code generation, agentic workflows, and complex problem-solving. The models utilize advanced Byte-level Byte Pair Encoding with a 151,646-token vocabulary, structured ChatML formatting for conversational interactions, and robust tool calling capabilities with parallel execution support. 

Available in both proprietary and open-weight versions with flexible licensing, comprehensive model variants (Base, Instruct, Thinking, and hybrid modes), and enhanced Model Context Protocol support, Qwen is ideal for developers seeking powerful, multilingual AI systems with sophisticated reasoning capabilities and minimal deployment complexity.

Qwen AI model family: Qwen3 language models, specialized coding & reasoning models, plus state-of-the-art embedding & reranking APIs for search and RAG applications.

Qwen Model APIs via DeepInfra

qwen

Qwen series offers a comprehensive suite of dense and mixture-of-experts models.

DeepInfra provides access to Qwen's latest generation of large language models, offering both specialized coding models and general-purpose AI systems with advanced reasoning capabilities.

Qwen

Most widely used version of Stable Diffusion. Trained on 512x512 images, it can generate realistic images given text description

QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark

We introduce StarCoder2-15B-Instruct-v0.1, the very first entirely self-aligned code Large Language Model (LLM) trained with a fully permissive and transparent pipeline. Our open-source pipeline uses StarCoder2-15B to generate thousands of instruction-response pairs, which are then used to fine-tune StarCoder-15B itself without any human annotations or distilled data from huge and proprietary LLMs.

Kimi K2 0905 is the September update of Kimi K2 0711. It is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It supports long-context inference up to 256k tokens, extended from the previous 128k.  This update improves agentic coding with higher accuracy and better generalization across scaffolds, and enhances frontend coding with more aesthetic and functional outputs for web, 3D, and related tasks. Kimi K2 is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. It excels across coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) benchmarks. The model is trained with a novel stack incorporating the MuonClip optimizer for stable large-scale MoE training.

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out).

We present a sentence transformation model that generates semantically similar sentences. Our model is based on the Sentence-Transformers architecture and was trained on a large dataset of sentence pairs. We evaluate the effectiveness of our model by measuring its ability to generate similar sentences that are close to the original sentence in meaning.

DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3's step-by-step reasoning, to create an initial cold start for reinforcement learning. 

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B).

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

ClarityAI/flux integrates the Flux AI model into the upscaling process, enabling high-resolution enhancements with superior face preservation and support for LoRAs to apply specific styles or identities.

Openchat 3.6 is a LLama-3-8b fine tune that outperforms it on multiple benchmarks.

Qwen2.5 is a model pretrained on a large-scale dataset of up to 18 trillion tokens, offering significant improvements in knowledge, coding, mathematics, and instruction following compared to its predecessor Qwen2. The model also features enhanced capabilities in generating long texts, understanding structured data, and generating structured outputs, while supporting multilingual capabilities for over 29 languages.

Voxtral Mini is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

QwQ is an experimental research model developed by the Qwen Team, designed to advance AI reasoning capabilities. This model embodies the spirit of philosophical inquiry, approaching problems with genuine wonder and doubt. QwQ demonstrates impressive analytical abilities, achieving scores of 65.2% on GPQA, 50.0% on AIME, 90.6% on MATH-500, and 50.0% on LiveCodeBench. With its contemplative approach and exceptional performance on complex problems.

Gemma is a family of lightweight, state-of-the-art open models from Google. The 9B Gemma 2 model delivers class-leading performance, outperforming Llama 3 8B and other open models in its size category.

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

olmOCR is a specialized AI tool that converts PDF documents into clean, structured text while preserving important formatting and layout information. What makes olmOCR particularly valuable for developers is its ability to handle challenging PDFs that traditional OCR tools struggle with—including complex layouts, poor-quality scans, handwritten text, and documents with mixed content types. Built on a fine-tuned 7B vision-language model, olmOCR provides enterprise-grade PDF processing at a fraction of the cost of proprietary solutions.

DeepSeek-V3-0324, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token, an improved iteration over DeepSeek-V3.

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

09/04 🔥 Introducing Chatterbox Multilingual in 23 Languages!  We're excited to introduce Chatterbox and Chatterbox Multilingual, Resemble AI's production-grade open source TTS models. Chatterbox Multilingual supports Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, Chinese out of the box. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.

The latest image model, delivering better editing consistency, improved multi-image fusion, finer detail control, natural small text and faces, and harmonious, aesthetic visuals.

Turn any image into a video. Intelligent shot scheduling supports multi-shot storytelling, generating multi-shot narrative videos with consistent subjects, scenes, and atmosphere

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2

Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time streaming performances.

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This variant contains implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models.

CodeGemma is a collection of lightweight open code models built on top of Gemma. CodeGemma models are text-to-text and text-to-code decoder-only models and are available as a 7 billion pretrained variant that specializes in code completion and code generation tasks, a 7 billion parameter instruction-tuned variant for code chat and instruction following and a 2 billion parameter pretrained variant for fast code completion.

FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. This model offers cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives. Trained using latent adversarial diffusion distillation, FLUX.1 [schnell] can generate high-quality images in only 1 to 4 steps. 

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.

Devstral is an agentic LLM for software engineering tasks, making it a great choice for software engineering agents.

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries. This model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo.  As of 16th Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.

12B model trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size.

Gemini 2.5 Flash is Google's latest thinking model, designed to tackle increasingly complex problems. It's capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.  Gemini 2.5 Flash: best for balancing reasoning and speed.

Bria 3.2 is the next-generation commercial-ready text-to-image model. With just 4 billion parameters, it provides exceptional aesthetics and text rendering, evaluated to be on par to leading open-source models, and outperforming other licensed models.

Gemini 2.5 Pro is Google's the most advanced thinking model, designed to tackle increasingly complex problems. Gemini 2.5 Pro leads common benchmarks by meaningful margins and showcases strong reasoning and code capabilities.  Gemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.  The Gemini 2.5 Pro model is now available on DeepInfra.

Black Forest Labs' latest state-of-the art proprietary model sporting top of the line prompt following, visual quality, details and output diversity.

The SDXL Turbo model, developed by Stability AI, is an optimized, fast text-to-image generative model. It is a distilled version of SDXL 1.0, leveraging Adversarial Diffusion Distillation (ADD) to generate high-quality images in less steps.

Multi-reference visual intelligence with unprecedented detail, color precision, and spatial reasoning.  The most advanced image generation and editing model. Generate photorealistic images with precise control.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrated a strong ability to generalise to many datasets and domains without fine-tuning. Whisper checks pens are available in five configurations of varying model sizes, including a smallest configuration trained on English-only data and a largest configuration trained on multilingual data. This one is English-only.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labeled data without fine-tuning. It's a Transformer based encoder-decoder model, trained on English-only or multilingual data, predicting transcriptions in the same or different language as the audio. Whisper checkpoints come in five configurations of varying model sizes.

Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This version has implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models.

This model is part of the GLM-V family of models, introduced in the paper GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

FLUX.1-dev is a state-of-the-art 12 billion parameter rectified flow transformer developed by Black Forest Labs. This model excels in text-to-image generation, providing highly accurate and detailed outputs. It is particularly well-regarded for its ability to follow complex prompts and generate anatomically accurate images, especially with challenging details like hands and faces.

The 7 billion parameter Qwen2 excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning.

The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is a instruct fine-tuned version of the Mistral-7B-v0.2 generative text model using a variety of publicly available conversation datasets.

ChatGPT said:  EmbeddingGemma is a 300M parameter multilingual open embedding model from Google DeepMind, designed for efficient deployment even on low-resource devices, producing high-quality text vector representations for tasks such as search, classification, clustering, and semantic similarity.

Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). It has significant improvements in code generation, code reasoning and code fixing. A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies.

Qwen3-235B-A22B-Thinking-2507 is the Qwen3's new model with scaling the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. 

L3.3-70B-Euryale-v2.3 is a model focused on creative roleplay from Sao10k

Gemma is an open-source model designed by Google. This is Gemma 1.1 7B (IT), an update over the original instruction-tuned Gemma release. Gemma 1.1 was trained using a novel RLHF method, leading to substantial gains on quality, coding capabilities, factuality, instruction following and multi-turn conversation quality.

Stable Diffusion is a latent text-to-image diffusion model. Generate realistic images given text description

Kimi K2 is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. Kimi K2 excels across a broad range of benchmarks, particularly in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) tasks.

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

Latest version of the Airoboros model fine-tunned version of llama-2-70b using the Airoboros dataset. This model is currently running jondurbin/airoboros-l2-70b-2.2.1 

The model is an auto-regressive vision language model that uses an optimized transformer architecture. The model enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities.

Veo 3.1 is the latest text-to-video model from Google that generates high-fidelity, cinematic videos with synchronized audio from a simple text prompt. It excels at creating realistic and imaginative scenes with a deep understanding of natural language and visual dynamics.

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.  Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.

The DeepSeek R1 0528 turbo model is a state of the art reasoning model that can generate very quick responses

Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning. The supervised fine-tuning dataset includes a blend of synthetic prompts and high-quality filtered data from public domain websites, focused on math, science, and coding skills as well as alignment data for safety and Responsible AI. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. Phi-4-reasoning-plus has been trained additionally with Reinforcement Learning, hence, it has higher accuracy but generates on average 50% more tokens, thus having higher latency.

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B)

Euryale 70B v2.1 is a model focused on creative roleplay from Sao10k

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

DeepSeek-V3.1 Terminus is an update to DeepSeek V3.1 that maintains the model's original capabilities while addressing issues reported by users, including language consistency and agent capabilities, further optimizing the model's performance in coding and search agents. It is a large hybrid reasoning model (671B parameters, 37B active) that supports both thinking and non-thinking modes. It extends the DeepSeek-V3 base with a two-phase long-context training process. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs  The model improves tool use, code generation, and reasoning efficiency, achieving performance comparable to DeepSeek-R1 on difficult benchmarks while responding more quickly. It supports structured tool calling, code agents, and search agents, making it suitable for research, coding, and agentic workflows.

Seedream 4.0 is a SOTA multimodal image creation model built on leading architecture. It breaks through the boundaries of traditional text-to-image models by natively supporting text, single-image, and multi-image inputs. Users can freely combine text and images to achieve diverse creative modes within a single model—such as multi-image blending, image editing, and sequentially batch image generation, featuring subject consistency, making image creation more free and controllable.

Qwen3-235B-A22B-Instruct-2507 is the updated version of the Qwen3-235B-A22B non-thinking mode, featuring Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage.  

Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board.

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.  This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

Wan2.6 text to image, Upgraded visual quality, aesthetics, and instruction-following deliver precise style control, realistic portraits, long-text understanding, and broad historical/cultural IP coverage, enabling high-quality, highly expressive visual generation.

Qwen3-Coder-480B-A35B-Instruct is the Qwen3's most agentic code model, featuring Significant Performance on Agentic Coding, Agentic Browser-Use and other foundational coding tasks, achieving results comparable to Claude Sonnet.

A drop-in replacement for Flux [Dev] that delivers sharper details, richer colors, and enhanced realism, while instantly boosting LoRAs and LyCORIS with full compatibility.

PixVerse's 720p resolution offers a fast and reliable option for generating standard HD videos, ideal for quick previews and social media content where generation speed is prioritized over maximum detail.

This is a 32B reasoning model trained from Qwen2.5-32B-Instruct with 17K data. The performance is on par with o1-preview model on both math and coding.

The CLIP model maps text and images to a shared vector space, enabling various applications such as image search, zero-shot image classification, and image clustering. The model can be used easily after installation, and its performance is demonstrated through zero-shot ImageNet validation set accuracy scores. Multilingual versions of the model are also available for 50+ languages.

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation.

WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger open-source leading models

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support

LLaMa 2 is a collections of LLMs trained by Meta. This is the 70B chat optimized version. This endpoint has per token pricing.

Mistral-7B-Instruct-v0.3 is an instruction-tuned model, next iteration of of Mistral 7B that has larger vocabulary, newer tokenizer and supports function calling.

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

  At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution. 

ByteDance's Seedance 1.5 Pro is a professional video model using V2A native generation for integrated, synced audio-visual output, enhancing efficiency of professional video creation.

You can use cURL or any other http client to run inferences:

```bash
curl -X POST \
    -d '{"prompt": "A kitten is yawning at the camera"}'  \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -H 'Content-Type: application/json'  \
    'https://api.deepinfra.com/v1/inference/ByteDance/Seedance-1.5-Pro'
```

which will give you back something similar to:

```json
{
  "video_url": "/model/inference/seedance_sample.mp4",
  "status": "ok",
  "out_tokens": 1000,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


You can use our command-line tool [deepctl](/docs/advanced/deepctl) to run
inferences:

```bash
deepctl infer \
    -m 'ByteDance/Seedance-1.5-Pro'  \
    -i 'prompt=A kitten is yawning at the camera'
```

which will give you back something similar to:

```json
{
  "video_url": "/model/inference/seedance_sample.mp4",
  "status": "ok",
  "out_tokens": 1000,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


prompt

First frame image for image-to-video generation

first_frame_image

Last frame image for image-to-video generation

last_frame_image

resolution

aspect_ratio

duration of the output video in seconds (4-12, or -1 for model to decide)

duration

seed

camera_fixed

whether to add a watermark to the output video

watermark

whether the generated video includes audio synchronized with the visuals

generate_audio

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Seedance-1.5-Pro

Input

Output

Precise audio-visual alignment

Multilingual group dialogue

Cinematic narrative quality