We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Art That Talks Back: A Hands-On Tutorial on Talking Images
Published on 2025.03.07 by Oguz Vuruskaner
Art That Talks Back: A Hands-On Tutorial on Talking Images

Imagine going to an art gallery where paintings tell their stories. That’s what "Talking Images" do in practice. This tutorial shows you how to make art speak using DeepInfra models. We are going to use:

1-) deepseek-ai/Janus-Pro-7B
2-) hexgrad/Kokoro-82M

Setting Up Environment

First, let’s set up your environment. You’ll need these packages. Here’s the content of requirements.txt:

gradio
requests
python-dotenv
pillow
scipy
numpy
copy

Venv Environment Setup

Show Venv Tutorial

python -m venv venv && (venv\Scripts\activate.bat 2>nul || source venv/bin/activate) && pip install -r requirements.txt
copy

Create .env File

Next, create a .env file in your project folder. Copy your DEEPINFRA_API_TOKEN into it. Your .env file should look like this:

DEEPINFRA_API_TOKEN=your-api-token-here
copy

Replace your-api-token-here with your actual DeepInfra API token.

The Code

Here’s the Python code that makes your images talk. It uses Janus-Pro-7B to describe the image and Kokoro-82M to turn that description into audio.

import os
from io import BytesIO
import gradio as gr
import base64
import requests
from dotenv import load_dotenv, find_dotenv
from scipy.io import wavfile
import numpy as np

_ = load_dotenv(find_dotenv())

def analyze_image(image) -> str:
    url = "https://api.deepinfra.com/v1/inference/deepseek-ai/Janus-Pro-7B"
    headers = {"Authorization": f"bearer {api_token}"}
    buffered = BytesIO()
    if image.mode == "RGBA":
        image = image.convert("RGB")
    format = "JPEG" if image.format == "JPEG" else "PNG"
    image.save(buffered, format=format)
    files = {"image": ("my_image." + format.lower(), buffered.getvalue(), f"image/{format.lower()}")}
    data = {
        "question": "I am this image. You must describe me in my own voice using 'I'. State my colors, shapes, mood, and any notable features with precise detail. Examples: 'I have clouds,' 'I contain sharp lines.' Be vivid, thorough, and factual."
    }
    response = requests.post(url, headers=headers, files=files, data=data)
    return response.json()["response"]

def text_to_speech(text: str) -> tuple:
    url = "https://api.deepinfra.com/v1/inference/hexgrad/Kokoro-82M"
    headers = {
        "Authorization": f"bearer {api_token}",
        "Content-Type": "application/json"
    }
    data = {
        "text": text
    }
    response = requests.post(url, json=data, headers=headers)
    res_json = response.json()
    audio_base64 = res_json["audio"].split(",")[1]
    audio_bytes = base64.b64decode(audio_base64)
    audio_io = BytesIO(audio_bytes)
    sample_rate, audio_data = wavfile.read(audio_io)
    return sample_rate, audio_data

def make_image_talk(image):
    description = analyze_image(image)
    sample_rate, audio_data = text_to_speech(description)
    return sample_rate, audio_data

if __name__ == "__main__":
    api_token = os.environ.get("DEEPINFRA_API_TOKEN")
    interface = gr.Interface(
        fn=make_image_talk,
        inputs=gr.Image(type="pil"),
        outputs=gr.Audio(type="numpy"),
        title="Art That Talks Back",
        description="Upload an image and hear it talk!"
    )
    interface.launch()
copy

Final Look

app.jpg

Try It Yourself!

Ready to hear your own art talk back? Grab yourself an image, run the code, and upload it. Do not forget to follow us on Linkedin and on X.

Related articles
Qwen3.5 4B via DeepInfra: Latency, Throughput & CostQwen3.5 4B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 4B (Reasoning) Qwen3.5 4B is a compact 4-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud&#8217;s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural [&hellip;]</p>
NVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & CostNVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Nano 30B A3B NVIDIA Nemotron 3 Nano 30B A3B is a large language model trained from scratch by NVIDIA, designed as a unified model for both reasoning and non-reasoning tasks. It is part of the Nemotron 3 family — NVIDIA&#8217;s most efficient family of open models, built for agentic AI applications. [&hellip;]</p>
Nemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra ResultsNemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM landscape is becoming increasingly diverse, with models optimized for reasoning, throughput, cost-efficiency, and real-world agentic applications. Two models that stand out in this new generation are NVIDIA’s Nemotron 3 Nano and OpenAI’s GPT-OSS-20B, both of which offer strong performance while remaining openly available and deployable across cloud and edge systems. Although both [&hellip;]</p>