We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…
openai/whisper-large-v3-turbo cover image
featured

openai/whisper-large-v3-turbo

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

Public
$0.00020 / minute
PaperLicense

HTTP/cURL API

You can use cURL or any other http client to run inferences:

curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F audio=@my_voice.mp3  \
    'https://api.deepinfra.com/v1/inference/openai/whisper-large-v3-turbo'
copy

which will give you back something similar to:

{
  "text": "",
  "segments": [
    {
      "end": 1.0,
      "id": 0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "id": 1,
      "start": 4.0,
      "text": "World"
    }
  ],
  "language": "en",
  "input_length_ms": 0,
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "duration": 0.0,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

copy

Input fields

audiostring

audio to transcribe


taskstring

task to perform

Default value: "transcribe"

Allowed values: transcribetranslate


initial_promptstring

optional text to provide as a prompt for the first window.


temperaturenumber

temperature to use for sampling

Default value: 0


languagestring

language that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja)

Allowed values: afamarasazbabebgbnbobrbscacscydadeeleneseteufafifofrglguhahawhehihrhthuhyidisitjajwkakkkmknkolalblnloltlvmgmimkmlmnmrmsmtmynenlnnnoocpaplpsptrorusasdsiskslsnsosqsrsusvswtatetgthtktltrttukuruzviyiyoyuezh


chunk_levelstring

chunk level, either 'segment' or 'word'

Default value: "segment"

Allowed values: segmentword


chunk_length_sinteger

chunk length in seconds to split audio

Default value: 30

Range: 1 ≤ chunk_length_s ≤ 30


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.