Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support

Qwen3-235B-A22B

Qwen3-30B-A3B

Qwen3-32B

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support. 

Qwen3-14B

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

Llama-4-Maverick-17B-128E-Instruct-FP8

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

Llama-4-Scout-17B-16E-Instruct

We introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. 

DeepSeek-R1-Turbo

DeepSeek-R1

QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

QwQ-32B

DeepSeek-V3-0324, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token, an improved iteration over DeepSeek-V3.

DeepSeek-V3-0324

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2

gemma-3-27b-it

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

gemma-3-12b-it

gemma-3-4b-it

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Kokoro-82M

Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.

Dia-1.6B

Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time streaming performances.

orpheus-3b-0.1-ft

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

csm-1b

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following: - Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian - Vision: English - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Phi-4-multimodal-instruct

DeepSeek-R1-Distill-Llama-70B is a highly efficient language model that leverages knowledge distillation to achieve state-of-the-art performance. This model distills the reasoning patterns of larger models into a smaller, more agile architecture, resulting in exceptional results on benchmarks like AIME 2024, MATH-500, and LiveCodeBench. With 70 billion parameters, DeepSeek-R1-Distill-Llama-70B offers a unique balance of accuracy and efficiency, making it an ideal choice for a wide range of natural language processing tasks. 

DeepSeek-R1-Distill-Llama-70B

DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. 

DeepSeek-V3

Llama 3.3-70B Turbo is a highly optimized version of the Llama 3.3-70B model, utilizing FP8 quantization to deliver significantly faster inference speeds with a minor trade-off in accuracy. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.

Llama-3.3-70B-Instruct-Turbo

Llama 3.3-70B is a multilingual LLM trained on a massive dataset of 15 trillion tokens, fine-tuned for instruction-following and conversational dialogue. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.

Llama-3.3-70B-Instruct

Mistral Small 3 is a 24B-parameter language model optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment.  The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware.

Mistral-Small-24B-Instruct-2501

DeepSeek R1 Distill Qwen 32B is a distilled large language model based on Qwen 2.5 32B, using outputs from DeepSeek R1. It outperforms OpenAI's o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.  Other benchmark results include:  AIME 2024: 72.6 | MATH-500: 94.3 | CodeForces Rating: 1691.

DeepSeek-R1-Distill-Qwen-32B

Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

phi-4

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

Meta-Llama-3.1-70B-Instruct

Meta-Llama-3.1-8B-Instruct

Meta-Llama-3.1-405B-Instruct

Meta-Llama-3.1-8B-Instruct-Turbo

Meta-Llama-3.1-70B-Instruct-Turbo

Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). It has significant improvements in code generation, code reasoning and code fixing. A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies.

Qwen2.5-Coder-32B-Instruct

Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries. This model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo.  As of 16th Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.

Llama-3.1-Nemotron-70B-Instruct

Qwen2.5 is a model pretrained on a large-scale dataset of up to 18 trillion tokens, offering significant improvements in knowledge, coding, mathematics, and instruction following compared to its predecessor Qwen2. The model also features enhanced capabilities in generating long texts, understanding structured data, and generating structured outputs, while supporting multilingual capabilities for over 29 languages.

Qwen2.5-72B-Instruct

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks.  This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis.

Llama-3.2-90B-Vision-Instruct

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis.  Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.

Llama-3.2-11B-Vision-Instruct

At 8 billion parameters, with superior quality and prompt adherence, this base model is the most powerful in the Stable Diffusion family. This model is ideal for professional use cases at 1 megapixel resolution

sd3.5

Black Forest Labs' latest state-of-the art proprietary model sporting top of the line prompt following, visual quality, details and output diversity.

FLUX-1.1-pro

FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. This model offers cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives. Trained using latent adversarial diffusion distillation, FLUX.1 [schnell] can generate high-quality images in only 1 to 4 steps. 

FLUX-1-schnell

FLUX.1-dev is a state-of-the-art 12 billion parameter rectified flow transformer developed by Black Forest Labs. This model excels in text-to-image generation, providing highly accurate and detailed outputs. It is particularly well-regarded for its ability to follow complex prompts and generate anatomically accurate images, especially with challenging details like hands and faces.

FLUX-1-dev

Black Forest Labs' first flagship model based on Flux latent rectified flow transformers

FLUX-pro

  At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution. 

sd3.5-medium

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford  et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

whisper-large-v3-turbo

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

whisper-large-v3

WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.

WizardLM-2-8x22B

You can use cURL or any other http client to run inferences:

```bash
curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F 'input=[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices.'  \
    'https://api.deepinfra.com/v1/inference/nari-labs/Dia-1.6B'
```

which will give you back something similar to:

```json
{
  "audio": null,
  "input_character_length": 0,
  "output_format": "",
  "words": [
    {
      "text": "Hello",
      "start": 0.0,
      "end": 1.0,
      "confidence": 0.5
    },
    {
      "text": "World",
      "start": 4.0,
      "end": 5.0,
      "confidence": 0.5
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

```


You can use our command-line tool [deepctl](/docs/advanced/deepctl) to run
inferences:

```bash
deepctl infer \
    -m 'nari-labs/Dia-1.6B'  \
    -i 'input=[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices.'
```

which will give you back something similar to:

```json
{
  "audio": null,
  "input_character_length": 0,
  "output_format": "",
  "words": [
    {
      "text": "Hello",
      "start": 0.0,
      "end": 1.0,
      "confidence": 0.5
    },
    {
      "text": "World",
      "start": 4.0,
      "end": 5.0,
      "confidence": 0.5
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

```


The DeepInfra OpenAI-compatible Speech API endpoint enables users to effortlessly convert text into speech audio. This document outlines how to integrate and utilize this endpoint to quickly create speech from text inputs, leveraging various audio output formats.

## Create speech

Use the following example of pythong code to generate an audio file from your text input:

```python
from pathlib import Path
from openai import OpenAI

client = OpenAI(base_url="https://api.deepinfra.com/v1/openai",
                api_key="$DEEPINFRA_TOKEN")

speech_file_path = Path(__file__).parent / "speech.mp3"
with client.audio.speech.with_streaming_response.create(
  model="nari-labs/Dia-1.6B",
  voice="luna",
  input="The quick brown fox jumped over the lazy dog.",
  response_format="mp3",
) as response:
  response.stream_to_file(speech_file_path)
```

The API returns the generated audio file in the requested format (e.g., `mp3`, `pcm`). The example above saves the audio output directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Best for lowest latency streaming*        |
| mp3    | Compressed audio suitable for storage | General use and file distribution          |
| opus   | Compressed audio ideal for streaming  | Efficient streaming and real-time playback |
| flac   | Lossless audio compression            | High-quality archival storage              |
| wav    | Uncompressed audio                    | High-quality audio processing tasks        |

### Performance Recommendations

- **For lowest latency (fastest initial audio chunk):**
  - Use the `pcm` output format.

- **For general purposes:**
  - Use the `mp3` or `opus` formats for optimal balance between quality and file size.



The DeepInfra OpenAI-compatible Speech API endpoint enables users to effortlessly convert text into speech audio. This document outlines how to integrate and utilize this endpoint to quickly create speech from text inputs, leveraging various audio output formats.

## Create speech

Use the following example of js code to generate an audio file from your text input:

```javascript
import fs from "fs";
import path from "path";
import OpenAI from "openai";

const openai = new OpenAI(base_url="https://api.deepinfra.com/v1/openai",
                          api_key="$DEEPINFRA_TOKEN");

const speechFile = path.resolve("./speech.mp3");

async function main() {
  const mp3 = await openai.audio.speech.create({
    model: "nari-labs/Dia-1.6B",
    voice: "luna",
    input: "The quick brown fox jumped over the lazy dog.",
    response_format: "mp3",
  });
  console.log(speechFile);
  const buffer = Buffer.from(await mp3.arrayBuffer());
  await fs.promises.writeFile(speechFile, buffer);
}
main();
```

The API returns the generated audio file in the requested format (e.g., `mp3`, `pcm`). The example above saves the audio output directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Best for lowest latency streaming*        |
| mp3    | Compressed audio suitable for storage | General use and file distribution          |
| opus   | Compressed audio ideal for streaming  | Efficient streaming and real-time playback |
| flac   | Lossless audio compression            | High-quality archival storage              |
| wav    | Uncompressed audio                    | High-quality audio processing tasks        |

### Performance Recommendations

- **For lowest latency (fastest initial audio chunk):**
  - Use the `pcm` output format.

- **For general purposes:**
  - Use the `mp3` or `opus` formats for optimal balance between quality and file size.



The DeepInfra OpenAI-compatible Speech API endpoint enables users to effortlessly convert text into speech audio. This document outlines how to integrate and utilize this endpoint to quickly create speech from text inputs, leveraging various audio output formats.

## Create speech

Use the following example `curl` request to generate an audio file from your text input:

```bash
curl https://api.deepinfra.com/v1/openai/audio/speech \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nari-labs/Dia-1.6B",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "luna",
    "response_format": "mp3"
  }' \
  --output speech.mp3
```

The API returns the generated audio file in the requested format (e.g., `mp3`, `pcm`). The example above saves the audio output directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Best for lowest latency streaming*        |
| mp3    | Compressed audio suitable for storage | General use and file distribution          |
| opus   | Compressed audio ideal for streaming  | Efficient streaming and real-time playback |
| flac   | Lossless audio compression            | High-quality archival storage              |
| wav    | Uncompressed audio                    | High-quality audio processing tasks        |

### Performance Recommendations

- **For lowest latency (fastest initial audio chunk):**
  - Use the `pcm` output format.

- **For general purposes:**
  - Use the `mp3` or `opus` formats for optimal balance between quality and file size.



The DeepInfra ElevenLabs-compatible Speech API endpoint allows users to seamlessly convert text inputs into high-quality speech audio. This document provides guidance on integrating and using this endpoint to generate realistic audio files efficiently.

Use the following examples of `py` request to generate an audio file from your text input:

## Create Speech (non-streaming)

```bash
from elevenlabs import ElevenLabs

client = ElevenLabs(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/",
)
client.text_to_speech.convert(
    voice_id="luna",
    output_format="mp3",
    text="The quick brown fox jumped over the lazy dog.",
    model_id="nari-labs/Dia-1.6B",
)
```

## Create Speech with Streaming

```bash
from elevenlabs import ElevenLabs

client = ElevenLabs(
    api_key="$DEEPINFRA_TOKEN",
)
client.text_to_speech.convert_as_stream(
    voice_id="luna",
    output_format="pcm",
    text="The quick brown fox jumped over the lazy dog.",
    model_id="nari-labs/Dia-1.6B",
)
```

The API returns the generated audio in the requested format, such as `mp3` or `pcm`. The example above saves the audio directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Lowest latency streaming scenarios*       |
| mp3    | Compressed audio suitable for storage | General use and easy file sharing          |
| opus   | Compressed audio ideal for streaming  | Real-time audio streaming applications     |
| flac   | Lossless audio compression            | Archival storage and high-fidelity needs   |
| wav    | Uncompressed audio                    | Audio processing and editing tasks         |

### Performance Recommendations

- **Lowest latency:** Using the `pcm` output format for streaming applications is **HIGHLY RECOMMENDED**.
- **General use:** Use `mp3` or `opus` formats for the best trade-off between quality and file size.



The DeepInfra ElevenLabs-compatible Speech API endpoint allows users to seamlessly convert text inputs into high-quality speech audio. This document provides guidance on integrating and using this endpoint to generate realistic audio files efficiently.

Use the following examples of `js` request to generate an audio file from your text input:

## Create Speech (non-streaming)

```bash
import { ElevenLabsClient } from "elevenlabs";

const client = new ElevenLabsClient({ apiKey: "$DEEPINFRA_TOKEN", base_url: "https://api.deepinfra.com/" });
await client.textToSpeech.convert("luna", {
    output_format: "mp3",
    text: "The quick brown fox jumped over the lazy dog.",
    model_id: "nari-labs/Dia-1.6B"
});
```

## Create Speech with Streaming

```bash
import { ElevenLabsClient } from "elevenlabs";

const client = new ElevenLabsClient({ apiKey: "$DEEPINFRA_TOKEN" });
await client.textToSpeech.convert("luna", {
    output_format: "pcm",
    text: "The quick brown fox jumped over the lazy dog.",
    model_id: "nari-labs/Dia-1.6B"
});
```

The API returns the generated audio in the requested format, such as `mp3` or `pcm`. The example above saves the audio directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Lowest latency streaming scenarios*       |
| mp3    | Compressed audio suitable for storage | General use and easy file sharing          |
| opus   | Compressed audio ideal for streaming  | Real-time audio streaming applications     |
| flac   | Lossless audio compression            | Archival storage and high-fidelity needs   |
| wav    | Uncompressed audio                    | Audio processing and editing tasks         |

### Performance Recommendations

- **Lowest latency:** Using the `pcm` output format for streaming applications is **HIGHLY RECOMMENDED**.
- **General use:** Use `mp3` or `opus` formats for the best trade-off between quality and file size.



The DeepInfra ElevenLabs-compatible Speech API endpoint allows users to seamlessly convert text inputs into high-quality speech audio. This document provides guidance on integrating and using this endpoint to generate realistic audio files efficiently.

Use the following examples of `curl` request to generate an audio file from your text input:

## Create Speech (non-streaming)

```bash
curl -X POST "https://api.deepinfra.com/v1/text-to-speech/luna" \
     -H "xi-api-key: $DEEPINFRA_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
  "text": "The quick brown fox jumped over the lazy dog.",
  "model_id": "nari-labs/Dia-1.6B",
  "output_format": "mp3",
}' --output speech.mp3
```

## Create Speech with Streaming

```bash
curl -X POST "https://api.deepinfra.com/v1/text-to-speech/luna/stream" \
     -H "xi-api-key: $DEEPINFRA_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
  "text": "The quick brown fox jumped over the lazy dog.",
  "model_id": "nari-labs/Dia-1.6B",
  "output_format": "pcm",
}' --output speech.pcm
```

The API returns the generated audio in the requested format, such as `mp3` or `pcm`. The example above saves the audio directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Lowest latency streaming scenarios*       |
| mp3    | Compressed audio suitable for storage | General use and easy file sharing          |
| opus   | Compressed audio ideal for streaming  | Real-time audio streaming applications     |
| flac   | Lossless audio compression            | Archival storage and high-fidelity needs   |
| wav    | Uncompressed audio                    | Audio processing and editing tasks         |

### Performance Recommendations

- **Lowest latency:** Using the `pcm` output format for streaming applications is **HIGHLY RECOMMENDED**.
- **General use:** Use `mp3` or `opus` formats for the best trade-off between quality and file size.



input

Audio prompt for the speech to be synthesized

speaker_audio

Transcript of the given speaker audio. If not provided then the speaker audio will be used as is.

speaker_transcript

Controls the maximum length of the generated audio (more tokens = longer audio).

max_new_tokens

Higher values increase adherence to the text prompt.

cfg_scale

Lower values make the output more deterministic, higher values increase randomness.

temperature

Filters vocabulary to the most likely tokens cumulatively reaching probability P.

top_p

cfg_filter_top_k

Adjusts the speed of the generated audio (1.0 = original speed).

speed

The webhook to call when inference is done, by default you will get the output in the response of your inference request

webhook

Cfg Filter Top K

Cfg Scale

Input text

Max New Tokens

Speaker Audio

Speaker Transcript

Speed

Temperature

Top P

Webhook

DiaTextToSpeechIn

estimated cost billed for the request in USD

Cost

Runtime Ms

Status

Tokens Generated

Tokens Input

InferenceReplyStatus

Start

Text

Word

Audio

Object containing the status of the inference request

Inference Status

Input Character Length

Output Format

Request Id

Words

TextToSpeechOut

model

voice

response_format

extra_body

Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.

nari-labs/Dia-1.6B

HTTP/cURL API

Input fields

`input`string

`speaker_audio`string

`speaker_transcript`string

`max_new_tokens`integer

`cfg_scale`integer

`temperature`number

`top_p`number

`cfg_filter_top_k`integer

`speed`number

`webhook`file

Input Schema

Output Schema

nari-labs/Dia-1.6B

HTTP/cURL API

Input fields

inputstring

speaker_audiostring

speaker_transcriptstring

max_new_tokensinteger

cfg_scaleinteger

temperaturenumber

top_pnumber

cfg_filter_top_kinteger

speednumber

webhookfile