GLM-5.1 is Z-AI's next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).

GLM-5.1

High-quality multilingual text-to-speech model by Inworld AI with 130+ preset voices across 15 languages. Supports voice cloning, word-level timestamps, and streaming. Optimized for natural, expressive speech with <250ms time-to-first-audio.

inworld-tts-1.5-max

Fast multilingual text-to-speech model by Inworld AI with 130+ preset voices across 15 languages. Supports voice cloning, word-level timestamps, and streaming. Optimized for low-latency applications with <130ms time-to-first-audio.

inworld-tts-1.5-mini

Step 3.5 Flash is an open-source reasoning model by StepFun with 196B total parameters (11B active) using Mixture of Experts. It features a 256K context window, deep reasoning, tool calling, and agentic capabilities, achieving 97.3 on AIME 2025 and 74.4% on SWE-bench Verified.

Step-3.5-Flash

Qwen3.5-397B-A17B is Alibaba's most capable Qwen3.5 model, a Mixture-of-Experts architecture with 397B total parameters and 17B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling with MCP integration, and support for 201 languages. Sets state-of-the-art results on reasoning, coding, math, and multimodal benchmarks.

Qwen3.5-397B-A17B

Qwen3.5-122B-A10B is a large Mixture-of-Experts model from Alibaba's Qwen3.5 series with 122B total parameters and 10B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling, and support for 201 languages. Excels at complex reasoning, coding, multimodal understanding, and agentic tasks with the efficiency of sparse activation.

Qwen3.5-122B-A10B

Qwen3.5-35B-A3B is an efficient Mixture-of-Experts model from Alibaba's Qwen3.5 series with 35B total parameters and only 3B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling, and support for 201 languages. Delivers strong performance on reasoning, coding, and vision-language tasks at a fraction of the compute cost.

Qwen3.5-35B-A3B

Qwen3.5-27B is Alibaba's largest dense Qwen3.5 model, delivering near-frontier quality across reasoning, coding, and instruction following. It features a 262K token context window (extensible to 1M), thinking/reasoning mode, tool calling, multi-token prediction, and support for 201 languages. Best suited for production deployments and complex enterprise tasks requiring top-tier performance.

Qwen3.5-27B

Qwen3.5-9B is a high-performance model from Alibaba's Qwen3.5 series with a hybrid Gated Delta Networks and sparse MoE architecture. It features a 262K token context window, thinking/reasoning mode, tool calling, multi-token prediction, and support for 201 languages. Excels at reasoning, coding, instruction following, and long-context tasks.

Qwen3.5-9B

Qwen3.5-4B is a mid-size model from Alibaba's Qwen3.5 series that delivers a strong balance of performance and efficiency. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling, and support for 201 languages. Well-suited for complex reasoning, code generation, and agentic applications.

Qwen3.5-4B

Qwen3.5-2B is a compact yet capable model from Alibaba's Qwen3.5 series. It features a 262K token context window, support for 201 languages, thinking/reasoning mode, and tool calling for agentic workflows. A strong choice for prototyping, fine-tuning, and efficient multilingual deployments.

Qwen3.5-2B

Qwen3.5-0.8B is Alibaba's smallest model in the Qwen3.5 series, featuring a hybrid Gated Delta Networks and sparse Mixture-of-Experts architecture. Despite its compact size, it supports a 262K token context window, 201 languages, thinking/reasoning mode, and tool calling. Ideal for edge deployments, resource-constrained environments, and lightweight inference tasks.

Qwen3.5-0.8B

Efficient, MoE variant of Gemma 4. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-26B-A4B-it

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-31B-it

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

NVIDIA-Nemotron-3-Super-120B-A12B

GLM-5 is an advanced, open-source large language model designed for developers tackling the toughest challenges. It excels at long-context reasoning, multi-step tool orchestration, and complex systems engineering, making it the ideal choice for powering sophisticated agents and applications that require high-level cognitive tasks.

GLM-5

  Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages.                                                                                                                                                                                                                                                                                                                                           Key capabilities:                                                                                                                                                                                                                                  - 9 preset voices — Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee — covering diverse genders, ages, and accents                                                                                                              - Voice cloning — clone any voice from a short (~3s) audio sample via the voice_id parameter   - Instruction control — adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese   - Streaming support — real-time PCM streaming with ~97ms first-byte latency   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and   environmental audio details.

Qwen3-TTS

● Qwen3-TTS-VoiceDesign is a voice design variant of Qwen3-TTS by Alibaba's Qwen team. Instead of selecting from preset voices, you describe the voice you want in natural language — and the model generates speech in that voice.                                                                                                                                                                                                                                                                     Key capabilities:                                                                                                                                                                                                                                  - Natural language voice control — describe any voice with free text (e.g. "a deep male voice with a calm, authoritative presence", "a young cheerful female with a warm and friendly tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese                                                                                                                                         - Streaming support — real-time PCM streaming   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on the same 1.7B parameter architecture as Qwen3-TTS, using discrete multi-codebook language modeling and a custom 12Hz acoustic tokenizer for high-quality end-to-end speech synthesis.

Qwen3-TTS-VoiceDesign

MiniMax M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).

MiniMax-M2.5

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

Qwen3-Max

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

Qwen3-Max-Thinking

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Kimi-K2.5

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

GLM-4.7-Flash

NVIDIA Nemotron 3 Nano is an open small reasoning model optimized for fast, cost-efficient inference in agentic and production workloads. Built with a hybrid Mixture-of-Experts (MoE) and Mamba-Transformer architecture, it delivers strong multi-step reasoning, high token throughput, stable latency with predictable cost, and efficient deployment for agent-based systems. Designed for real-world AI systems where reasoning can generate significantly more tokens per prompt, Nemotron Nano reduces compute cost while maintaining strong reasoning quality.

Nemotron-3-Nano-30B-A3B

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.

DeepSeek-V3.2

Chatterbox is a family of three state-of-the-art, open-source text-to-speech models by Resemble AI.  We are excited to introduce Chatterbox-Turbo, our most efficient model yet. Built on a streamlined 350M parameter architecture, Turbo delivers high-quality speech with less compute and VRAM than our previous models. We have also distilled the speech-token-to-mel decoder, previously a bottleneck, reducing generation from 10 steps to just one, while retaining high-fidelity audio output.  Paralinguistic tags are now native to the Turbo model, allowing you to use [cough], [laugh], [chuckle], and more to add distinct realism. While Turbo was built primarily for low-latency voice agents, it excels at narration and creative workflows.  If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (link).

chatterbox-turbo

The fastest model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-4b

The best quality-to-latency ratio, production apps model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-9b

Developed by Anthropic, Claude is a family of highly performant, trustworthy AI models built for complex reasoning, advanced coding, and nuanced language understanding. The latest Claude 4 generation delivers breakthrough capabilities in analytical thinking, with Claude 4 Opus setting new standards for intelligence and Claude 4 Sonnet providing exceptional performance with remarkable efficiency.

Claude models excel at understanding context, following complex instructions, and maintaining coherent conversations across extended interactions. With advanced features like extended thinking for deeper reasoning, prompt caching that reduces costs by up to 90%, vision capabilities for image analysis, and robust safety measures, Claude is designed for enterprise applications that demand both sophistication and reliability.

Available with comprehensive API features including streaming responses, batch processing for 50% cost savings, multilingual support across dozens of languages, and flexible context windows up to 200K tokens (1M in beta), Claude is perfect for building intelligent applications like customer support agents, content analysis systems, coding assistants, and complex reasoning workflows that require both accuracy and trustworthiness.

Claude AI family: Claude 4 Opus for complex reasoning, Claude 4 Sonnet for balanced performance, plus advanced capabilities like extended thinking, prompt caching, vision analysis, and enterprise-grade safety APIs.

Claude AI APIs via DeepInfra

claude

Developed by Anthropic, Claude is a family of highly performant, trustworthy AI models built for complex reasoning, advanced coding, and nuanced language understanding

DeepInfra provides access to Anthropic's latest Claude models, featuring the most advanced reasoning capabilities and balanced performance options, all with enterprise-grade safety and reliability.

Claude

DeepSeek develops advanced foundation models optimized for computational efficiency and strong generalization across diverse tasks. The architecture incorporates recent advances in transformer-based systems, delivering robust performance in both zero-shot and fine-tuned scenarios. Models are pretrained on rigorously filtered multilingual corpora with specialized optimizations for mathematical reasoning and algorithmic tasks. The inference stack achieves competitive throughput while maintaining low latency, making it suitable for production deployment. Researchers and engineers can leverage these models for tasks ranging from natural language processing to complex analytical problem-solving.

deepseek

DeepSeek's models are a suite of advanced AI systems that prioritize efficiency, scalability, and real-world applicability.

DeepSeek

Developed by Black Forest Labs (the original creators behind Stable Diffusion), Flux is a family of state-of-the-art image generation and editing models that deliver exceptional visual quality with breakthrough prompt accuracy and photorealism. Built on advanced 12 billion parameter architecture, Flux models excel at understanding exactly what you want to create or modify.
FLUX 2 represents the next generation of the Flux family, introducing significantly improved image quality, faster generation, and more precise prompt following. The lineup includes Max for ultimate quality, Pro for production workflows, and Klein variants (9B and 4B) optimized for speed and efficiency — making high-quality image generation accessible at every scale.
Flux offers specialized variants for every need: from open-source to commercial licensing, Flux is perfect for developers building creative applications, product visualization tools, and next-generation image editing experiences.

Flux AI image generation family: FLUX.1 Kontext for in-context editing, FLUX.1 Pro/Dev for text-to-image synthesis, plus comprehensive editing tools and state-of-the-art visual generation APIs.

Flux Image Generation APIs via DeepInfra

flux

Developed by Black Forest Labs, Flux is a family of state-of-the-art image generation and editing models that deliver exceptional visual quality with breakthrough prompt accuracy and photorealism.

DeepInfra provides access to Black Forest Labs' complete Flux ecosystem, offering everything from lightning-fast generation to sophisticated in-context editing capabilities with industry-leading prompt adherence and visual quality.

Flux

Developed by Google DeepMind, Gemini is a family of state-of-the-art thinking models with native multimodal capabilities, designed for advanced reasoning, complex problem-solving, and comprehensive understanding across text, audio, video, and images. Built with revolutionary thinking architecture, Gemini models reason through problems step-by-step before responding, delivering enhanced accuracy and performance for sophisticated applications.

Gemini 2.5 Pro sets new standards for complex reasoning and coding excellence, while Gemini 2.5 Flash provides optimal price-performance for high-volume tasks. With massive context windows up to 1 million tokens, native multimodal processing that handles hours of video and audio, and transparent reasoning capabilities that show step-by-step thinking processes, Gemini excels at document analysis, code generation, scientific research, and agentic workflows.

Perfect for building intelligent applications that require deep reasoning, multimodal understanding, long-context processing, and transparent AI decision-making with Google's enterprise-grade reliability and performance.


Gemini AI family: Advanced thinking models with native multimodal processing for text, audio, video, and image understanding APIs

Gemini AI Model APIs via DeepInfra

gemini

Developed by Google DeepMind, Gemini is a family of state-of-the-art thinking models with native multimodal capabilities

DeepInfra provides access to Google's latest Gemini models, featuring advanced thinking capabilities, native multimodal processing, and industry-leading performance for complex reasoning and development tasks.

Gemini

Developed by Meta, Llama (Large Language Model Meta AI) is a family of state-of-the-art open-weight models designed for efficiency and performance. The latest versions feature Mixture-of-Experts (MoE) architectures, enabling cost-effective inference by dynamically activating subsets of parameters. With support for multimodal inputs (text + images) and extended context windows (up to 10M tokens), Llama excels in tasks like code generation, multilingual understanding, and long-form reasoning. The models support FP8 quantization and batch inference, ensuring low-latency, high-throughput performance for production workloads. With permissive licensing and robust tooling (e.g., Llama Guard for safety), Llama is ideal for developers seeking powerful, customizable AI with minimal overhead.

llama

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

Llama 4

Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.

Llama 3

Llama

Developed by Mistral AI, a leading French research lab, Mistral is a family of open-source AI models built for multilingual excellence, advanced reasoning, and cost-effective performance. These models excel at complex reasoning, mathematics, coding, and specialized tasks while offering complete transparency and deployment freedom through open-source licensing.

Mistral Small 3.2 delivers breakthrough efficiency with native fluency in European languages, while specialized variants handle specific needs: Devstral for coding, Voxtral for audio processing, and Mixtral for high-performance tasks. With Apache 2.0 licensing, extensive context windows up to 128K tokens, and comprehensive customization options, Mistral provides enterprise-grade capabilities without vendor lock-in.

Perfect for building multilingual applications, coding assistants, and reasoning systems where you need both powerful performance and complete control over your AI deployment.

Mistral AI model family: Mistral Small 3.2 for efficient performance, Mixtral for specialized tasks, Devstral for coding, plus multilingual reasoning, mathematics, and open-source flexibility APIs.

Mistral AI Model APIs via DeepInfra

mistral

Developed by Mistral AI, a leading French research lab, Mistral is a family of open-source AI models built for multilingual excellence, advanced reasoning, and cost-effective performance

DeepInfra provides access to Mistral AI's comprehensive open-source model ecosystem, from efficient small models to specialized coding and audio processing variants, all with complete Apache 2.0 licensing freedom.

Mistral

Voxtral is a family of audio models with state-of-the-art speech to text capabilities.

Voxtral

[NVIDIA NemotronTM](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) is a family of open models, datasets, and training recipes engineered for high performance, efficiency and customization. Nemotron models support synthetic data workflows and supervised fine-tuning — and are equally optimized for real-time inference, reasoning agents, and production AI systems.

nemotron

NVIDIA Nemotron is a family of open models customized for efficiency, accuracy, and specialized workloads.

The Nemotron family spans Nano, Super, and specialized instruct variants, enabling you to balance accuracy, reasoning depth, latency, and cost for your specific workload.
- **Nano** for maximum efficiency and stable inference
- **Super** for multi-agent systems and advance reasoning
- **Instruct variants** for instruction-following and conversational workloads

Nemotron

Developed by Alibaba Group's Qwen Team, Qwen is a family of state-of-the-art large language and multimodal models designed for comprehensive AI capabilities and multilingual performance. The latest Qwen3 generation features balanced model architectures including reintroduced Mixture-of-Experts (MoE) variants (Qwen3-30B-A3B and Qwen3-235B-A22B) alongside dense models up to 32B parameters, enabling efficient resource utilization through dynamic parameter activation. 

With support for 119 languages and dialects, hybrid thinking modes that seamlessly alternate between reasoning and instruction-following without model switching, and extended context windows (up to 1M tokens in Qwen3-2507), Qwen excels in tasks like multilingual understanding, code generation, agentic workflows, and complex problem-solving. The models utilize advanced Byte-level Byte Pair Encoding with a 151,646-token vocabulary, structured ChatML formatting for conversational interactions, and robust tool calling capabilities with parallel execution support. 

Available in both proprietary and open-weight versions with flexible licensing, comprehensive model variants (Base, Instruct, Thinking, and hybrid modes), and enhanced Model Context Protocol support, Qwen is ideal for developers seeking powerful, multilingual AI systems with sophisticated reasoning capabilities and minimal deployment complexity.

Qwen AI model family: Qwen3 language models, specialized coding & reasoning models, plus state-of-the-art embedding & reranking APIs for search and RAG applications.

Qwen Model APIs via DeepInfra

qwen

Qwen series offers a comprehensive suite of dense and mixture-of-experts models.

DeepInfra provides access to Qwen's latest generation of large language models, offering both specialized coding models and general-purpose AI systems with advanced reasoning capabilities.

Qwen

Hermes 3 is a cutting-edge language model that offers advanced capabilities in roleplaying, reasoning, and conversation. It's a fine-tuned version of the Llama-3.1 405B foundation model, designed to align with user needs and provide powerful control. Key features include reliable function calling, structured output, generalist assistant capabilities, and improved code generation. Hermes 3 is competitive with Llama-3.1 Instruct models, with its own strengths and weaknesses.

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.  Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.

The Multilingual-E5 models, initialized from XLM-RoBERTa, support up to 512 tokens per input — any longer text will be silently truncated. To ensure optimal performance, always prefix inputs with “query:” or “passage:”, as the model was explicitly trained with this format.

ClarityAI/creative is an AI-powered image upscaler that enhances details, adds realism, and creatively modifies images to improve their quality and visual appeal.

FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. This model offers cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives. Trained using latent adversarial diffusion distillation, FLUX.1 [schnell] can generate high-quality images in only 1 to 4 steps. 

Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2

At 8 billion parameters, with superior quality and prompt adherence, this base model is the most powerful in the Stable Diffusion family. This model is ideal for professional use cases at 1 megapixel resolution

DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats.

FLUX.1 Kontext [dev] is a 12-billion-parameter image editing model that transforms visuals based on natural language instructions. It allows highly consistent, multi-step edits and is released with open weights under a non-commercial license to empower artists and researchers.

This model is part of the GLM-V family of models, introduced in the paper GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

Bria Erase Foreground precisely removes main subjects or foreground objects from images. Built entirely on licensed data, it is safe and optimized for professional and commercial use.

Built for the Agent era, it delivers stable performance in complex reasoning and long-horizon tasks, including multi-step planning, visual-text reasoning, video understanding, and advanced analysis.

Bria Background Generation allows for efficient swapping of backgrounds in images via text prompts or reference image, delivering realistic and polished results. Trained exclusively on licensed data for safe and risk-free commercial use.

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B).

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

The Multilingual-E5-large model is a 24-layer text embedding model with an embedding size of 1024, trained on a mixture of multilingual datasets and supporting 100 languages.

Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma-2-27B delivers the best performance for its size class, and even offers competitive alternatives to models more than twice its size. 

A Mythomax/MLewd_13B-style merge of selected 70B models  A multi-model merge of several  LLaMA2 70B finetunes for roleplaying and creative work. The goal was to create a model that combines creativity with intelligence for an enhanced experience.

Automatically identify and segment foreground objects across video frames and generate a mask. No prompts, just a video.

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

Devstral is an agentic LLM for software engineering tasks. Devstral excels at using tools to explore codebases, editing multiple files and power software engineering agents. 

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

gpt-oss-20b is an open-weight 21B parameter model released by OpenAI under the Apache 2.0 license. It uses a Mixture-of-Experts (MoE) architecture with 3.6B active parameters per forward pass, optimized for lower-latency inference. The model is trained in OpenAI’s Harmony response format and supports reasoning level configuration, fine-tuning, and agentic capabilities including function calling, tool use, and structured outputs.

🥳 For a limited time, Fibo Edit is free on DeepInfra 🥳  YOUR AI, YOUR RULES. Visual Generation for Production-Grade. FIBO Edit. An open-source image editing model with native masking and a lightweight 8B architecture.

LLaMa 2 is a collections of LLMs trained by Meta. This is the 70B chat optimized version. This endpoint has per token pricing.

Most widely used version of Stable Diffusion. Trained on 512x512 images, it can generate realistic images given text description

The DeepSeek R1 0528 turbo model is a state of the art reasoning model that can generate very quick responses

Gemma is a family of lightweight, state-of-the-art open models from Google. The 9B Gemma 2 model delivers class-leading performance, outperforming Llama 3 8B and other open models in its size category.

Veo 3 Fast is a speed-optimized version of the Veo 3 model, designed for rapid video creation. While maintaining high quality, it delivers results in a fraction of the time, making it ideal for quick iterations and dynamic content generation.

We present a sentence transformation model that generates semantically similar sentences. Our model is based on the Sentence-Transformers architecture and was trained on a large dataset of sentence pairs. We evaluate the effectiveness of our model by measuring its ability to generate similar sentences that are close to the original sentence in meaning.

The Dolphin 2.6 Mixtral 8x7b model is a finetuned version of the Mixtral-8x7b model, trained on a variety of data including coding data, for 3 days on 4 A100 GPUs. It is uncensored and requires trust_remote_code. The model is very obedient and good at coding, but not DPO tuned. The dataset has been filtered for alignment and bias. The model is compliant with user requests and can be used for various purposes such as generating code or engaging in general chat.

Nemotron-4-340B-Instruct is a chat model intended for use for the English language, designed for Synthetic Data Generation

Veo 3.1 is the latest text-to-video model from Google that generates high-fidelity, cinematic videos with synchronized audio from a simple text prompt. It excels at creating realistic and imaginative scenes with a deep understanding of natural language and visual dynamics.

NVIDIA Nemotron 2 Nano VL extends the Nemotron family into multi-modal reasoning and document intelligence. This auto-regressive vision-language model enables multi-image reasoning, video understanding, visual Q&A and document analysis and summarization. Optimized for enterprise AI workflows, it powers multimodal agentic systems such as visual copilots, document assistants, and knowledge automation pipelines.

Bria GenFill enables high-quality object addition or visual transformation. Trained exclusively on licensed data for safe and risk-free commercial use.

The Deliberate Model allows for the creation of anything desired, with the potential for better results as the user's knowledge and detail in the prompt increase. The model is ideal for meticulous anatomy artists, creative prompt writers, art designers, and those seeking explicit content.

Qwen3-Coder-30B-A3B-Instruct is a high-performance code generation model optimized for agentic coding and complex programming tasks. With 30.5B total parameters and 3.3B activated through Mixture-of-Experts architecture, it delivers exceptional efficiency. The model features native support for 256K token context (extendable to 1M), making it ideal for repository-scale code understanding. It excels at tool calling, browser automation, and multi-step coding workflows.

This model is a multilingual version of the OpenAI CLIP-ViT-B32 model, which maps text and images to a common dense vector space. It includes a text embedding model that works for 50+ languages and an image encoder from CLIP. The model was trained using Multilingual Knowledge Distillation, where a multilingual DistilBERT model was trained as a student model to align the vector space of the original CLIP image encoder across many languages.

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Qwen3-Coder-480B-A35B-Instruct is the Qwen3's most agentic code model, featuring Significant Performance on Agentic Coding, Agentic Browser-Use and other foundational coding tasks, achieving results comparable to Claude Sonnet.

Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

The llama-nemotron-rerank-vl-1b-v2 is a 1.7B parameter multimodal reranking model designed to evaluate and order the relevance of document images and text against specific user queries. It excels at understanding complex visual content like charts, tables, and infographics.

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

olmOCR is a specialized AI tool that converts PDF documents into clean, structured text while preserving important formatting and layout information. What makes olmOCR particularly valuable for developers is its ability to handle challenging PDFs that traditional OCR tools struggle with—including complex layouts, poor-quality scans, handwritten text, and documents with mixed content types. Built on a fine-tuned 7B vision-language model, olmOCR provides enterprise-grade PDF processing at a fraction of the cost of proprietary solutions.

The new top-tier image model from Black Forest Labs, significantly pushing image quality and editing consistency

We introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. 

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.  This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

Openchat 3.6 is a LLama-3-8b fine tune that outperforms it on multiple benchmarks.

Remove unwanted objects or regions from video using a mask, reconstructs the background with intelligent content-aware fill.

CodeGemma is a collection of lightweight open code models built on top of Gemma. CodeGemma models are text-to-text and text-to-code decoder-only models and are available as a 7 billion pretrained variant that specializes in code completion and code generation tasks, a 7 billion parameter instruction-tuned variant for code chat and instruction following and a 2 billion parameter pretrained variant for fast code completion.

This is a 32B reasoning model trained from Qwen2.5-32B-Instruct with 17K data. The performance is on par with o1-preview model on both math and coding.

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B)

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis.  Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.

The CLIP model maps text and images to a shared vector space, enabling various applications such as image search, zero-shot image classification, and image clustering. The model can be used easily after installation, and its performance is demonstrated through zero-shot ImageNet validation set accuracy scores. Multilingual versions of the model are also available for 50+ languages.

Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.

Kimi K2 0905 is the September update of Kimi K2 0711. It is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It supports long-context inference up to 256k tokens, extended from the previous 128k.  This update improves agentic coding with higher accuracy and better generalization across scaffolds, and enhances frontend coding with more aesthetic and functional outputs for web, 3D, and related tasks. Kimi K2 is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. It excels across coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) benchmarks. The model is trained with a novel stack incorporating the MuonClip optimizer for stable large-scale MoE training.

Over the past few months, we have observed increasingly clear trends toward scaling both total parameters and context lengths in the pursuit of more powerful and agentic artificial intelligence (AI). We are excited to share our latest advancements in addressing these demands, centered on improving scaling efficiency through innovative model architecture. We call this next-generation foundation models Qwen3-Next.

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

Increase video resolution up to 8K with advanced AI upscaling. Bring your videos to the big screen, ready for the screens of tomorrow.

A zero-shot-image-classification model released by OpenAI.
The clip-vit-large-patch14-336 model was trained from scratch on an unknown dataset and achieves unspecified results on the evaluation set. The model's intended uses and limitations, as well as its training and evaluation data, are not provided. The training procedure used an unknown optimizer and precision, and the framework versions included Transformers 4.21.3, TensorFlow 2.8.2, and Tokenizers 0.12.1.

The 72 billion parameter Qwen2 excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning.

This offers the imaginative writing style of chronos while still retaining coherency and being capable. Outputs are long and utilize exceptional prose. Supports a maxium context length of 4096. The model follows the Alpaca prompt format.

Kimi K2 is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. Kimi K2 excels across a broad range of benchmarks, particularly in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) tasks.

ClarityAI/flux integrates the Flux AI model into the upscaling process, enabling high-resolution enhancements with superior face preservation and support for LoRAs to apply specific styles or identities.

DeepSeek-V3.2-Exp is an intermediate step toward the next-generation architecture of the DeepSeek models by introducing DeepSeek Sparse Attention—a sparse attention mechanism designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.

DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. 

Model Details Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

The CLIP model was developed by OpenAI to investigate the robustness of computer vision models. It uses a Vision Transformer architecture and was trained on a large dataset of image-caption pairs. The model shows promise in various computer vision tasks but also has limitations, including difficulties with fine-grained classification and potential biases in certain applications.

Llama-3.3-Nemotron-Super-49B-v1.5 is a large language model (LLM) optimized for advanced reasoning, conversational interactions, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta's Llama-3.3-70B-Instruct, it employs a Neural Architecture Search (NAS) approach, significantly enhancing efficiency and reducing memory requirements. 

Mistral Small 3 is a 24B-parameter language model optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment.  The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware.

The latest image model, delivering better editing consistency, improved multi-image fusion, finer detail control, natural small text and faces, and harmonious, aesthetic visuals.

The Phi-3-Medium-4K-Instruct is a powerful and lightweight language model with 14 billion parameters, trained on high-quality data to excel in instruction following and safety measures. It demonstrates exceptional performance across benchmarks, including common sense, language understanding, and logical reasoning, outperforming models of similar size.

Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This version has implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models.

You can use cURL or any other http client to run inferences:

```bash
curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F audio=@my_voice.mp3  \
    'https://api.deepinfra.com/v1/inference/openai/whisper-timestamped-medium'
```

which will give you back something similar to:

```json
{
  "text": "",
  "segments": [
    {
      "end": 1.0,
      "id": 0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "id": 1,
      "start": 4.0,
      "text": "World"
    }
  ],
  "language": "en",
  "input_length_ms": 0,
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "duration": 0.0,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


You can use our command-line tool [deepctl](/docs/advanced/deepctl) to run
inferences:

```bash
deepctl infer \
    -m 'openai/whisper-timestamped-medium'  \
    -i audio=@my_voice.mp3
```

which will give you back something similar to:

```json
{
  "text": "",
  "segments": [
    {
      "end": 1.0,
      "id": 0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "id": 1,
      "start": 4.0,
      "text": "World"
    }
  ],
  "language": "en",
  "input_length_ms": 0,
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "duration": 0.0,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


We recommend using our NodeJS client https://github.com/deepinfra/deepinfra-node.

You can install it with

```bash
npm install deepinfra
```

and then

```javascript
import { AutomaticSpeechRecognition } from "deepinfra";
import path from "path";
import { fileURLToPath } from 'url';


const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

const DEEPINFRA_API_KEY = "$DEEPINFRA_TOKEN";
const MODEL = "openai/whisper-timestamped-medium";

const main = async () => {
  const client = new AutomaticSpeechRecognition(MODEL, DEEPINFRA_API_KEY);

  const input = {
    audio: path.join(__dirname, "audio.mp3"),
  };
  const response = await client.generate(input);
  console.log(response.text);
};

main();
```


You can POST to our OpenAI Transcriptions and Translations compatible endpoint.

# Create transcription

For a given audio file and model, the endpoint will return the **transcription object** or a **verbose transcription object**.

## Request body

- **file** (Required): The audio file object to transcribe. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `openai/whisper-timestamped-medium` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **language** (Optional): The language of the input audio. Supplying the input language in ISO-639-1 format can improve accuracy and latency.
- **prompt** (Optional): An optional text prompt to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): Controls the sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 make it more focused and deterministic. If set to 0, the model will adjust automatically to increase temperature as needed.
- **timestamp_granularities[]** (Optional): Specifies the timestamp granularity for transcription. Requires `response_format` to be set to `verbose_json`. Options: `word` - generates timestamps for individual words, `segment` - generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

## Response body

The transcription object or a verbose transcription object.

### Basic request

```bash
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
 -H "Content-Type: multipart/form-data" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -F file="@/path/to/file/audio.mp3" \
 -F model="openai/whisper-timestamped-medium"
```
```json
{
 "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
```

### Word timestamp request

```bash
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
 -H "Content-Type: multipart/form-data" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -F file="@/path/to/file/audio.mp3" \
 -F model="openai/whisper-timestamped-medium" \
 -F response_format="verbose_json" \
 -F "timestamp_granularities[]=word"
```

```json
{
 "task": "transcribe",
 "language": "english",
 "duration": 8.470000267028809,
 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
 "words": [
 {
 "word": "The",
 "start": 0.0,
 "end": 0.23999999463558197
 },
 ...
 {
 "word": "volleyball",
 "start": 7.400000095367432,
 "end": 7.900000095367432
 }
 ]
}
```

### Segment timestamp request

```bash
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
 -H "Content-Type: multipart/form-data" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -F file="@/path/to/file/audio.mp3" \
 -F model="openai/whisper-timestamped-medium" \
 -F response_format="verbose_json" \
 -F "timestamp_granularities[]=segment"
```

```json
{
 "task": "transcribe",
 "language": "english",
 "duration": 8.470000267028809,
 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
 "segments": [
 {
 "id": 0,
 "seek": 0,
 "start": 0.0,
 "end": 3.319999933242798,
 "text": " The beach was a popular spot on a hot summer day.",
 "tokens": [
 50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
 ],
 "temperature": 0.0,
 "avg_logprob": -0.2860786020755768,
 "compression_ratio": 1.2363636493682861,
 "no_speech_prob": 0.00985979475080967
 },
 ...
 ]
}
```

# Create translation

For a given audio file and model, the endpoint will return the translated text to English.

## Request body

- **file** (Required): The audio file object to translate. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `openai/whisper-timestamped-medium` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **prompt** (Optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

## Response body

The translated text to English.

### Basic request

```bash
curl "https://api.deepinfra.com/v1/openai/audio/translations" \
 -H "Content-Type: multipart/form-data" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -F file="@/path/to/file/german.m4a" \
 -F model="openai/whisper-timestamped-medium"
```

```json
{
 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
```

You can use OpenAI's Python SDK to interact with our OpenAI Transcriptions and Translations compatible endpoint.

# Create transcription

For a given audio file and model, the endpoint will return the **transcription object** or a **verbose transcription object**.

## Request body

- **file** (Required): The audio file object to transcribe. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `openai/whisper-timestamped-medium` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **language** (Optional): The language of the input audio. Supplying the input language in ISO-639-1 format can improve accuracy and latency.
- **prompt** (Optional): An optional text prompt to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): Controls the sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 make it more focused and deterministic. If set to 0, the model will adjust automatically to increase temperature as needed.
- **timestamp_granularities[]** (Optional): Specifies the timestamp granularity for transcription. Requires `response_format` to be set to `verbose_json`. Options: `word` - generates timestamps for individual words, `segment` - generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

## Response body

The transcription object or a verbose transcription object.

### Example

```python
from openai import OpenAI
client = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
 model="whisper-1",
 file=audio_file
)
```
```json
{
 "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
```

# Create translation

For a given audio file and model, the endpoint will return the translated text to English.

## Request body

- **file** (Required): The audio file object to translate. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `openai/whisper-timestamped-medium` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **prompt** (Optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

## Response body

The translated text to English.

### Basic request

```python
from openai import OpenAI
client = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

audio_file = open("speech.mp3", "rb")
transcript = client.audio.translations.create(
 model="whisper-1",
 file=audio_file
)
```

```json
{
 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
```

You can use OpenAI's JavaScript SDK to interact with our OpenAI Transcriptions and Translations compatible endpoint.

```bash
npm install openai
```

# Create transcription

For a given audio file and model, the endpoint will return the **transcription object** or a **verbose transcription object**.

## Request body

- **file** (Required): The audio file object to transcribe. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `openai/whisper-timestamped-medium` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **language** (Optional): The language of the input audio. Supplying the input language in ISO-639-1 format can improve accuracy and latency.
- **prompt** (Optional): An optional text prompt to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): Controls the sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 make it more focused and deterministic. If set to 0, the model will adjust automatically to increase temperature as needed.
- **timestamp_granularities[]** (Optional): Specifies the timestamp granularity for transcription. Requires `response_format` to be set to `verbose_json`. Options: `word` - generates timestamps for individual words, `segment` - generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

## Response body

The transcription object or a verbose transcription object.

### Example

```js
import fs from "fs";
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const translation = await openai.audio.translations.create({
 file: fs.createReadStream("speech.mp3"),
 model: "openai/whisper-timestamped-medium",
 });

 console.log(translation.text);
}
main();
```
```json
{
 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
```

# Create translation

For a given audio file and model, the endpoint will return the translated text to English.

## Request body

- **file** (Required): The audio file object to translate. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `openai/whisper-timestamped-medium` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **prompt** (Optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

## Response body

The translated text to English.

### Basic request

```js
import fs from "fs";
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const translation = await openai.audio.translations.create({
 file: fs.createReadStream("speech.mp3"),
 model: "openai/whisper-timestamped-medium",
 });

 console.log(translation.text);
}
main();
```

```json
{
 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
```

audio

task

optional text to provide as a prompt for the first window.

initial_prompt

temperature

language that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja)

whisper-timestamped-medium

HTTP/cURL API

Input fields

Input Schema

Output Schema