FastVideo/

LTX-2.3-Distilled-Diffusers

$0.0350

/ second

A fast, step-distilled build of Lightricks' LTX-2.3 diffusion-transformer video model (distilled by FastVideo). Generates high-fidelity text-to-video and image-to-video in just a few denoising steps.

Public

Project Paper License

api versions

Input

Prompt

text prompt describing the video content

Negative Prompt

Negative text prompt (optional, not required); leave blank to fall back to the model's default negative.. (Default: uncanny face, mask-like, plastic skin, doll-like, waxy, mannequin, cgi, 3d render, deformed face, distorted face, extra fingers, deformed hands, blurry, washed out, vintage, 1970s, sepia, grainy, low quality)

Seconds

Clip duration: always 5 seconds (fixed/required for this model).

Resolution

Output resolution: always 1080p (fixed/required for this model).

Orientation

Output orientation: always landscape (fixed/required for this model).

Image Url

First-frame image for image-to-video (i2v): an http(s) URL or a data: URI. Required only for i2v; omit for text-to-video.. (Default: empty)

You need to log in to use this model

Log In

Settings

Seed

specify a seed for reproducible output (Default: empty)

Output

Model Information

LTX-2.3 Distilled

LTX-2.3 is a diffusion-transformer (DiT) audio-video foundation model from Lightricks that generates high-fidelity video with synchronized audio from text or a starting image. This endpoint serves the distilled variant, accelerated with FastVideo (Hao AI Lab, UCSD) to produce results in only a few denoising steps.

Capabilities

Text-to-video — generate a clip from a text prompt.
Image-to-video — animate a still image by passing image_url (an http(s) URL or a data: URI).
Synchronized audio — produces a matching audio track (speech, ambient sound, music) alongside the video.
High-fidelity output at up to 1080p.

Usage

Provide a descriptive prompt. For image-to-video, also pass image_url. Use negative_prompt to steer away from unwanted artifacts and seed for reproducible results. Detailed, concrete prompts — subject, action, setting, lighting, camera motion, and any sound or dialogue — produce the strongest results; for image-to-video, describe the motion you want applied to the supplied image.

Limitations

Not intended for generating factual or accurate real-world information.
May reflect or amplify societal biases present in its training data.
Prompt adherence can vary with phrasing and style.
Audio quality is lower for non-speech sounds than for speech.
Can produce unexpected or inappropriate content from some prompts.

Model & credits

Base model: LTX-2.3 by Lightricks — model page · docs · LTX-2 repo
Distillation & inference: FastVideo (Hao AI Lab, UCSD) — GitHub
Paper: LTX-2: Efficient Joint Audio-Visual Foundation Model
License: LTX-2 Community License Agreement

That's the readme done. The full set is now ready to paste:

Description: A fast, step-distilled build of Lightricks' LTX-2.3 audio-video diffusion-transformer model (distilled by FastVideo). Generates high-fidelity video with synchronized audio from text or an image, in just a few denoising steps.
Project link: https://github.com/hao-ai-lab/FastVideo
Paper link: https://arxiv.org/abs/2601.03233
License link: https://huggingface.co/Lightricks/LTX-2.3/blob/main/LICENSE