We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

FastVideo/

LTX-2.3-Distilled-Diffusers

$0.0350

/ second

A fast, step-distilled build of Lightricks' LTX-2.3 diffusion-transformer video model (distilled by FastVideo). Generates high-fidelity text-to-video and image-to-video in just a few denoising steps.

FastVideo/LTX-2.3-Distilled-Diffusers cover image

Input

Prompt

text prompt describing the video content

Negative Prompt

Negative text prompt (optional, not required); leave blank to fall back to the model's default negative.. (Default: uncanny face, mask-like, plastic skin, doll-like, waxy, mannequin, cgi, 3d render, deformed face, distorted face, extra fingers, deformed hands, blurry, washed out, vintage, 1970s, sepia, grainy, low quality)

Seconds

Clip duration: always 5 seconds (fixed/required for this model).

Resolution

Output resolution: always 1080p (fixed/required for this model).

Orientation

Output orientation: always landscape (fixed/required for this model).

Image Url

First-frame image for image-to-video (i2v): an http(s) URL or a data: URI. Required only for i2v; omit for text-to-video.. (Default: empty)

You need to log in to use this model

Log In

Settings

Seed

specify a seed for reproducible output (Default: empty)

Output

Model Information

LTX-2.3 Distilled

LTX-2.3 is a diffusion-transformer (DiT) audio-video foundation model from Lightricks that generates high-fidelity video with synchronized audio from text or a starting image. This endpoint serves the distilled variant, accelerated with FastVideo (Hao AI Lab, UCSD) to produce results in only a few denoising steps.

Capabilities

  • Text-to-video — generate a clip from a text prompt.
  • Image-to-video — animate a still image by passing image_url (an http(s) URL or a data: URI).
  • Synchronized audio — produces a matching audio track (speech, ambient sound, music) alongside the video.
  • High-fidelity output at up to 1080p.

Usage

Provide a descriptive prompt. For image-to-video, also pass image_url. Use negative_prompt to steer away from unwanted artifacts and seed for reproducible results. Detailed, concrete prompts — subject, action, setting, lighting, camera motion, and any sound or dialogue — produce the strongest results; for image-to-video, describe the motion you want applied to the supplied image.

Limitations

  • Not intended for generating factual or accurate real-world information.
  • May reflect or amplify societal biases present in its training data.
  • Prompt adherence can vary with phrasing and style.
  • Audio quality is lower for non-speech sounds than for speech.
  • Can produce unexpected or inappropriate content from some prompts.

Model & credits

That's the readme done. The full set is now ready to paste: