🚀 New model available: DeepSeek-V3.1 🚀
sesame/
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
You can use cURL or any other http client to run inferences:
curl -X POST \
-d '{"text": "The quick brown fox jumps over the lazy dog"}' \
-H "Authorization: bearer $DEEPINFRA_TOKEN" \
-H 'Content-Type: application/json' \
'https://api.deepinfra.com/v1/inference/sesame/csm-1b'
which will give you back something similar to:
{
"audio": null,
"input_character_length": 0,
"output_format": "",
"words": [
{
"end": 1.0,
"start": 0.0,
"text": "Hello"
},
{
"end": 5.0,
"start": 4.0,
"text": "World"
}
],
"request_id": null,
"inference_status": {
"status": "unknown",
"runtime_ms": 0,
"cost": 0.0,
"tokens_generated": 0,
"tokens_input": 0
}
}
© 2025 Deep Infra. All rights reserved.