We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

FLUX.2 is live! High-fidelity image generation made simple.

Building a Voice Assistant with Whisper, LLM, and TTS
Published on 2024.09.20 by Askar Aitzhan
Building a Voice Assistant with Whisper, LLM, and TTS

Building a Voice Assistant with Whisper, LLM, and TTS

In this tutorial, we'll walk you through the process of creating a voice assistant using three powerful AI technologies:

  1. Whisper: For speech recognition
  2. LLM: For natural language processing and conversation
  3. TTS: For text-to-speech conversion

All the models are available on DeepInfra. But we will use OpenAI's python client to interact with LLM. And ElevenLabs' python client to interact with TTS.

Prerequisites

Before we begin, make sure you have the following installed and set up:

Create a virtual environment

python3 -m venv .venv
copy

Activate the environment

source .venv/bin/activate
copy

Install required libraries

brew install portaudio
pip install openai elevenlabs pyaudio numpy deepinfra scipy requests
copy

You'll also need to set up API key for DeepInfra.

Step 1: Speech Recognition with Whisper

First, let's use Whisper to transcribe user speech:

import pyaudio
import wave
import numpy as np
import requests
import json
import io
from scipy.io import wavfile
from openai import OpenAI
from elevenlabs import ElevenLabs, play

DEEPINFRA_API_KEY = "YOUR_DEEPINFRA_TOKEN"
WHISPER_MODEL = "distil-whisper/distil-large-v3"

def record_audio(duration=5, sample_rate=16000):
    CHUNK = 1024
    FORMAT = pyaudio.paInt16
    CHANNELS = 1

    p = pyaudio.PyAudio()

    print("Recording...")
    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=sample_rate,
                    input=True,
                    frames_per_buffer=CHUNK)

    frames = []

    for i in range(0, int(sample_rate / CHUNK * duration)):
        data = stream.read(CHUNK)
        frames.append(data)

    print("Recording complete.")

    stream.stop_stream()
    stream.close()
    p.terminate()

    # Convert to numpy array
    audio = np.frombuffer(b''.join(frames), dtype=np.int16)
    return audio

def transcribe_audio(audio):
    # Convert numpy array to WAV file in memory
    buffer = io.BytesIO()
    wavfile.write(buffer, 16000, audio.astype(np.int16))
    buffer.seek(0)

    # Prepare the request
    url = f'https://api.deepinfra.com/v1/inference/{WHISPER_MODEL}'
    headers = {
        "Authorization": f"bearer {DEEPINFRA_API_KEY}"
    }
    files = {
        'audio': ('audio.wav', buffer, 'audio/wav')
    }

    # Send the request
    response = requests.post(url, headers=headers, files=files)
    
    if response.status_code == 200:
        result = json.loads(response.text)
        return result['text']
    else:
        print(f"Error: {response.status_code}")
        print(response.text)
        return None
copy

Usage

audio = record_audio()
transcription = transcribe_audio(audio)
print(f"Transcription: {transcription}")
copy

Step 2: Conversing with LLM using OpenAI Client

Now, let's use the OpenAI client to interact with an LLM:

openai_client = OpenAI(api_key=DEEPINFRA_API_KEY, base_url="https://api.deepinfra.com/v1/openai")

MODEL_DI = "meta-llama/Meta-Llama-3.1-70B-Instruct"
def chat_with_llm(user_input):
    response = openai_client.chat.completions.create(
        model=MODEL_DI,
        messages=[{"role": "user", "content": user_input}],
        max_tokens=1000,
    )
    return response.choices[0].message.content
copy

Usage

llm_response = chat_with_llm(transcription)
print(f"LLM Response: {llm_response}")
copy

Step 3: Text-to-Speech with ElevenLabs

client = ElevenLabs(api_key=DEEPINFRA_API_KEY, base_url="https://api.deepinfra.com")

def text_to_speech(text):
    audio = client.generate(
        text=text,
        voice="luna",
        model="deepinfra/tts"
    )
    play(audio)
copy

Usage

text_to_speech(llm_response)
copy

Putting It All Together

Now, let's combine all these steps into a single voice assistant function:

def voice_assistant():
    while True:
        # Record and transcribe audio
        audio = record_audio()
        transcription = transcribe_audio(audio)
        print(f"You said: {transcription}")
        # Chat with LLM
        llm_response = chat_with_llm(transcription)
        print(f"Assistant: {llm_response}")
        # Convert response to speech
        text_to_speech(llm_response)
        # Ask if the user wants to continue
        if input("Continue? (y/n): ").lower() != 'y':
            break
copy

Run the voice assistant

voice_assistant()
copy

This voice assistant will continuously listen for user input, transcribe it, process it with an LLM, and respond with synthesized speech until the user chooses to stop.

Remember to replace YOUR_DEEPINFRA_TOKEN with your actual API key.

By leveraging the power of Whisper for speech recognition, LLM for intelligent conversation, and TTS for natural-sounding text-to-speech, you can create a sophisticated voice assistant capable of understanding and responding to a wide range of user queries.

Related articles
Power the Next Era of Image Generation with FLUX.2 Visual Intelligence on DeepInfraPower the Next Era of Image Generation with FLUX.2 Visual Intelligence on DeepInfraDeepInfra is excited to support FLUX.2 from day zero, bringing the newest visual intelligence model from Black Forest Labs to our platform at launch. We make it straightforward for developers, creators, and enterprises to run the model with high performance, transparent pricing, and an API designed for productivity.
Deploy Custom LLMs on DeepInfraDeploy Custom LLMs on DeepInfraDid you just finetune your favorite model and are wondering where to run it? Well, we have you covered. Simple API and predictable pricing. Put your model on huggingface Use a private repo, if you wish, we don't mind. Create a hf access token just for the repo for better security. Create c...
Art That Talks Back: A Hands-On Tutorial on Talking ImagesArt That Talks Back: A Hands-On Tutorial on Talking ImagesTurn any image into a talking masterpiece with this step-by-step guide using DeepInfra’s GenAI models.