GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Imagine going to an art gallery where paintings tell their stories. That’s what "Talking Images" do in practice. This tutorial shows you how to make art speak using DeepInfra models. We are going to use:
1-) deepseek-ai/Janus-Pro-7B
2-) hexgrad/Kokoro-82M
First, let’s set up your environment. You’ll need these packages. Here’s the content of requirements.txt:
gradio
requests
python-dotenv
pillow
scipy
numpy
python -m venv venv && (venv\Scripts\activate.bat 2>nul || source venv/bin/activate) && pip install -r requirements.txt
Next, create a .env file in your project folder. Copy your DEEPINFRA_API_TOKEN into it. Your .env file should look like this:
DEEPINFRA_API_TOKEN=your-api-token-here
Replace your-api-token-here with your actual DeepInfra API token.
Here’s the Python code that makes your images talk. It uses Janus-Pro-7B to describe the image and Kokoro-82M to turn that description into audio.
import os
from io import BytesIO
import gradio as gr
import base64
import requests
from dotenv import load_dotenv, find_dotenv
from scipy.io import wavfile
import numpy as np
_ = load_dotenv(find_dotenv())
def analyze_image(image) -> str:
url = "https://api.deepinfra.com/v1/inference/deepseek-ai/Janus-Pro-7B"
headers = {"Authorization": f"bearer {api_token}"}
buffered = BytesIO()
if image.mode == "RGBA":
image = image.convert("RGB")
format = "JPEG" if image.format == "JPEG" else "PNG"
image.save(buffered, format=format)
files = {"image": ("my_image." + format.lower(), buffered.getvalue(), f"image/{format.lower()}")}
data = {
"question": "I am this image. You must describe me in my own voice using 'I'. State my colors, shapes, mood, and any notable features with precise detail. Examples: 'I have clouds,' 'I contain sharp lines.' Be vivid, thorough, and factual."
}
response = requests.post(url, headers=headers, files=files, data=data)
return response.json()["response"]
def text_to_speech(text: str) -> tuple:
url = "https://api.deepinfra.com/v1/inference/hexgrad/Kokoro-82M"
headers = {
"Authorization": f"bearer {api_token}",
"Content-Type": "application/json"
}
data = {
"text": text
}
response = requests.post(url, json=data, headers=headers)
res_json = response.json()
audio_base64 = res_json["audio"].split(",")[1]
audio_bytes = base64.b64decode(audio_base64)
audio_io = BytesIO(audio_bytes)
sample_rate, audio_data = wavfile.read(audio_io)
return sample_rate, audio_data
def make_image_talk(image):
description = analyze_image(image)
sample_rate, audio_data = text_to_speech(description)
return sample_rate, audio_data
if __name__ == "__main__":
api_token = os.environ.get("DEEPINFRA_API_TOKEN")
interface = gr.Interface(
fn=make_image_talk,
inputs=gr.Image(type="pil"),
outputs=gr.Audio(type="numpy"),
title="Art That Talks Back",
description="Upload an image and hear it talk!"
)
interface.launch()
Ready to hear your own art talk back? Grab yourself an image, run the code, and upload it. Do not forget to follow us on Linkedin and on X.
Model Distillation Making AI Models EfficientAI Model Distillation Definition & Methodology
Model distillation is the art of teaching a smaller, simpler model to perform as well as a larger one. It's like training an apprentice to take over a master's work—streamlining operations with comparable performance . If you're struggling with depl...
Qwen3.5 9B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 9B Qwen3.5 9B is the flagship of Alibaba’s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes […]</p>
Llama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep Infra<p>Llama 3.1 70B Instruct is Meta’s widely-used, instruction-tuned model for high-quality dialogue and tool use. With a ~131K-token context window, it can read long prompts and multi-file inputs—great for agents, RAG, and IDE assistants. But how “good” it feels in practice depends just as much on the inference provider as on the model: infra, batching, […]</p>
© 2026 Deep Infra. All rights reserved.