🚀 New model available: DeepSeek-V3.1 🚀
sesame/
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
Text to convert to speech
Settings
Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm. 5
Select the desired voice for the speech output. You can select multiple to combine and mix voices. 7
Temperature of the generation (Default: 0.9)
Please upload an audio file
Transcript of the given speaker audio. If not provided then the speaker audio will be used as is.. (Default: empty)
Maximum audio length in milliseconds (Default: 10000)
Whether to stream audio bytes in chunks 2
Waiting for audio data... Submit request to start streaming.
© 2025 Deep Infra. All rights reserved.