We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…
nari-labs/Dia-1.6B cover image
featured

nari-labs/Dia-1.6B

Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.

Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.

Public
$20.00 per M characters
ProjectPaperLicense

HTTP/cURL API

You can use cURL or any other http client to run inferences:

curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F 'input=[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices.'  \
    'https://api.deepinfra.com/v1/inference/nari-labs/Dia-1.6B'
copy

which will give you back something similar to:

{
  "audio": null,
  "input_character_length": 0,
  "output_format": "",
  "words": [
    {
      "text": "Hello",
      "start": 0.0,
      "end": 1.0,
      "confidence": 0.5
    },
    {
      "text": "World",
      "start": 4.0,
      "end": 5.0,
      "confidence": 0.5
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

copy

Input fields

inputstring

Text to convert to speech


speaker_audiostring

Audio prompt for the speech to be synthesized


speaker_transcriptstring

Transcript of the given speaker audio. If not provided then the speaker audio will be used as is.


max_new_tokensinteger

Controls the maximum length of the generated audio (more tokens = longer audio).

Default value: 3072

Range: 500 ≤ max_new_tokens ≤ 4096


cfg_scaleinteger

Higher values increase adherence to the text prompt.

Default value: 3

Range: 1 ≤ cfg_scale ≤ 5


temperaturenumber

Lower values make the output more deterministic, higher values increase randomness.

Default value: 1.3

Range: 1 ≤ temperature ≤ 1.5


top_pnumber

Filters vocabulary to the most likely tokens cumulatively reaching probability P.

Default value: 0.95

Range: 0.8 ≤ top_p ≤ 1


cfg_filter_top_kinteger

Top k filter for CFG guidance.

Default value: 35

Range: 15 ≤ cfg_filter_top_k ≤ 50


speednumber

Adjusts the speed of the generated audio (1.0 = original speed).

Default value: 0.94

Range: 0.8 ≤ speed ≤ 1


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema