Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:
Search

Category/all

black-forest-labs/FLUX-1.1-pro cover image
featured
$0.04 / img
  • text-to-image

Black Forest Labs' latest state-of-the art proprietary model sporting top of the line prompt following, visual quality, details and output diversity.

black-forest-labs/FLUX-1-schnell cover image
featured
$0.0005 x (width / 1024) x (height / 1024) x iters
  • text-to-image

FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. This model offers cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives. Trained using latent adversarial diffusion distillation, FLUX.1 [schnell] can generate high-quality images in only 1 to 4 steps.

black-forest-labs/FLUX-1-dev cover image
featured
$0.009 x (width / 1024) x (height / 1024) x (iters / 25)
  • text-to-image

FLUX.1-dev is a state-of-the-art 12 billion parameter rectified flow transformer developed by Black Forest Labs. This model excels in text-to-image generation, providing highly accurate and detailed outputs. It is particularly well-regarded for its ability to follow complex prompts and generate anatomically accurate images, especially with challenging details like hands and faces.

black-forest-labs/FLUX-pro cover image
featured
$0.05 / img
  • text-to-image

Black Forest Labs' first flagship model based on Flux latent rectified flow transformers

stabilityai/sd3.5-medium cover image
featured
bf16
$0.03 / img
  • text-to-image

At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution.

hexgrad/Kokoro-82M cover image
featured
$0.80 per M characters
  • text-to-speech

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out). On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a .onnx version of v0.19 is available.

openai/whisper-large-v3-turbo cover image
featured
$0.00020 / minute
  • automatic-speech-recognition

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

openai/whisper-large-v3 cover image
featured
$0.00045 / minute
  • automatic-speech-recognition

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

distil-whisper/distil-large-v3 cover image
featured
$0.00018 / minute
  • automatic-speech-recognition

Distil-Whisper was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. This is the third and final installment of the Distil-Whisper English series. It the knowledge distilled version of OpenAI's Whisper large-v3, the latest and most performant Whisper model to date. Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give superior long-form transcription accuracy with OpenAI's sequential long-form algorithm.

microsoft/WizardLM-2-8x22B cover image
featured
bfloat16
64k
$0.50 / Mtoken
  • text-generation

WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.

Austism/chronos-hermes-13b-v2 cover image
fp16
4k
Replaced
  • text-generation

This offers the imaginative writing style of chronos while still retaining coherency and being capable. Outputs are long and utilize exceptional prose. Supports a maxium context length of 4096. The model follows the Alpaca prompt format.

BAAI/bge-base-en-v1.5 cover image
512
$0.005 / Mtoken
  • embeddings

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

BAAI/bge-large-en-v1.5 cover image
512
$0.010 / Mtoken
  • embeddings

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

BAAI/bge-m3 cover image
fp32
8k
$0.010 / Mtoken
  • embeddings

BGE-M3 is a versatile text embedding model that supports multi-functionality, multi-linguality, and multi-granularity, allowing it to perform dense retrieval, multi-vector retrieval, and sparse retrieval in over 100 languages and with input sizes up to 8192 tokens. The model can be used in a retrieval pipeline with hybrid retrieval and re-ranking to achieve higher accuracy and stronger generalization capabilities. BGE-M3 has shown state-of-the-art performance on several benchmarks, including MKQA, MLDR, and NarritiveQA, and can be used as a drop-in replacement for other embedding models like DPR and BGE-v1.5.

CompVis/stable-diffusion-v1-4 cover image
Replaced
  • text-to-image

Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.

Gryphe/MythoMax-L2-13b-turbo cover image
fp8
4k
Replaced
  • text-generation

Faster version of Gryphe/MythoMax-L2-13b running on multiple H100 cards in fp8 precision. Up to 160 tps.