We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

inworld-ai/

inworld-tts-1.5-max

$10.00

/ 1M characters

High-quality multilingual text-to-speech model by Inworld AI with 130+ preset voices across 15 languages. Supports voice cloning, word-level timestamps, and streaming. Optimized for natural, expressive speech with <250ms time-to-first-audio.

Partner
Public
inworld-ai/inworld-tts-1.5-max cover image

Input

Input text

Text to convert to speech

Settings

ServiceTier

The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority (only applies to models that support it).

Voice

Preset voice name (Ashley, Diego, etc.) or a voice_id from /v1/voices/add for voice cloning.

TtsResponseFormat

Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.

Speaking rate

Speaking rate of the speech (Default: 1, 0.5 ≤ speaking_rate ≤ 1.5)

Temperature

Temperature controls variability of the speech (Default: empty, 0 ≤ temperature ≤ 2)

Sample rate

Sample rate for the output audio (Default: 24000)

Return timestamps

Whether to return word-level timestamps

Stream

Whether to stream the output

Output

Waiting for audio data... Submit request to start streaming.

Model Information

Inworld TTS 1.5 Max is a high-quality text-to-speech model developed by Inworld AI. It delivers natural, expressive speech across 15 languages with 130+ preset voices and support for instant voice cloning.

Key Features

  • 130+ preset voices — diverse styles including narrators, conversationalists, character voices, and more
  • 15 languages — English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Chinese, Japanese, Korean, Hindi, Arabic, Hebrew
  • Voice cloning — create a custom voice from 5–15 seconds of reference audio
  • Word-level timestamps — precise timing data for each word, useful for lip-sync, captions, and UI highlighting
  • Streaming — low-latency audio streaming with <250ms P90 time-to-first-audio
  • Multiple output formats — PCM, MP3, OPUS at 16kHz, 24kHz, or 48kHz sample rates

Parameters

ParameterTypeDefaultDescription
textstringrequiredText to synthesize (up to 500,000 characters)
voicestring"Ashley"Voice name from 130+ presets, or a cloned voice ID
output_formatstring"mp3"Output audio format: mp3, wav, opus, pcm
speaking_ratefloat1.0Speed of speech (0.5–1.5)
temperaturefloat1.1Controls variability in synthesis (0–2). Higher values produce more expressive speech
sample_rateint24000Audio sample rate: 16000, 24000, or 48000 Hz
return_timestampsboolfalseReturn word-level timestamps in the response
speaker_audiobinarynoneReference audio for voice cloning (5–15 seconds)

Voices (sample)

Ashley, Blake, Dennis, Diego, Dominus, Elizabeth, Hades, Luna, Pixie, and 120+ more across all supported languages.

Pricing

$10 per 1 million input characters.