inworld-ai/

inworld-tts-1.5-max

$10.00

/ 1M characters

High-quality multilingual text-to-speech model by Inworld AI with 130+ preset voices across 15 languages. Supports voice cloning, word-level timestamps, and streaming. Optimized for natural, expressive speech with <250ms time-to-first-audio.

Partner

Public

inworld-ai/inworld-tts-1.5-max cover image

api versions voice

Input

Input text

Text to convert to speech

Settings

ServiceTier

The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority (only applies to models that support it).

Voice

Preset voice name (Ashley, Diego, etc.) or a voice_id from /v1/voices/add for voice cloning.

TtsResponseFormat

Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.

Speaking rate

Speaking rate of the speech (Default: 1, 0.5 ≤ speaking_rate ≤ 1.5)

Temperature

Temperature controls variability of the speech (Default: empty, 0 ≤ temperature ≤ 2)

Sample rate

Sample rate for the output audio (Default: 24000)

Return timestamps

Whether to return word-level timestamps

Stream

Whether to stream the output

Output

Waiting for audio data... Submit request to start streaming.

Model Information

Inworld TTS 1.5 Max is a high-quality text-to-speech model developed by Inworld AI. It delivers natural, expressive speech across 15 languages with 130+ preset voices and support for instant voice cloning.

Key Features

130+ preset voices — diverse styles including narrators, conversationalists, character voices, and more
15 languages — English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Chinese, Japanese, Korean, Hindi, Arabic, Hebrew
Voice cloning — create a custom voice from 5–15 seconds of reference audio
Word-level timestamps — precise timing data for each word, useful for lip-sync, captions, and UI highlighting
Streaming — low-latency audio streaming with <250ms P90 time-to-first-audio
Multiple output formats — PCM, MP3, OPUS at 16kHz, 24kHz, or 48kHz sample rates

Parameters

Parameter	Type	Default	Description
`text`	string	required	Text to synthesize (up to 500,000 characters)
`voice`	string	`"Ashley"`	Voice name from 130+ presets, or a cloned voice ID
`output_format`	string	`"mp3"`	Output audio format: `mp3`, `wav`, `opus`, `pcm`
`speaking_rate`	float	`1.0`	Speed of speech (0.5–1.5)
`temperature`	float	`1.1`	Controls variability in synthesis (0–2). Higher values produce more expressive speech
`sample_rate`	int	`24000`	Audio sample rate: 16000, 24000, or 48000 Hz
`return_timestamps`	bool	`false`	Return word-level timestamps in the response
`speaker_audio`	binary	none	Reference audio for voice cloning (5–15 seconds)