We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

🚀 New model available: DeepSeek-V3.1 🚀

sesame/

csm-1b

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

Public
$7.00 per M characters
ProjectPaperLicense
sesame/csm-1b cover image

Input

Text to convert to speech

Settings

Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm. 5

Select the desired voice for the speech output. You can select multiple to combine and mix voices. 7

Temperature of the generation (Default: 0.9)

Please upload an audio file

Transcript of the given speaker audio. If not provided then the speaker audio will be used as is.. (Default: empty)

Maximum audio length in milliseconds (Default: 10000)

Whether to stream audio bytes in chunks 2

Output

Waiting for audio data... Submit request to start streaming.

Model Information