DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Imagine going to an art gallery where paintings tell their stories. That’s what "Talking Images" do in practice. This tutorial shows you how to make art speak using DeepInfra models. We are going to use:
1-) deepseek-ai/Janus-Pro-7B
2-) hexgrad/Kokoro-82M
First, let’s set up your environment. You’ll need these packages. Here’s the content of requirements.txt:
gradio
requests
python-dotenv
pillow
scipy
numpy
python -m venv venv && (venv\Scripts\activate.bat 2>nul || source venv/bin/activate) && pip install -r requirements.txt
Next, create a .env file in your project folder. Copy your DEEPINFRA_API_TOKEN into it. Your .env file should look like this:
DEEPINFRA_API_TOKEN=your-api-token-here
Replace your-api-token-here with your actual DeepInfra API token.
Here’s the Python code that makes your images talk. It uses Janus-Pro-7B to describe the image and Kokoro-82M to turn that description into audio.
import os
from io import BytesIO
import gradio as gr
import base64
import requests
from dotenv import load_dotenv, find_dotenv
from scipy.io import wavfile
import numpy as np
_ = load_dotenv(find_dotenv())
def analyze_image(image) -> str:
url = "https://api.deepinfra.com/v1/inference/deepseek-ai/Janus-Pro-7B"
headers = {"Authorization": f"bearer {api_token}"}
buffered = BytesIO()
if image.mode == "RGBA":
image = image.convert("RGB")
format = "JPEG" if image.format == "JPEG" else "PNG"
image.save(buffered, format=format)
files = {"image": ("my_image." + format.lower(), buffered.getvalue(), f"image/{format.lower()}")}
data = {
"question": "I am this image. You must describe me in my own voice using 'I'. State my colors, shapes, mood, and any notable features with precise detail. Examples: 'I have clouds,' 'I contain sharp lines.' Be vivid, thorough, and factual."
}
response = requests.post(url, headers=headers, files=files, data=data)
return response.json()["response"]
def text_to_speech(text: str) -> tuple:
url = "https://api.deepinfra.com/v1/inference/hexgrad/Kokoro-82M"
headers = {
"Authorization": f"bearer {api_token}",
"Content-Type": "application/json"
}
data = {
"text": text
}
response = requests.post(url, json=data, headers=headers)
res_json = response.json()
audio_base64 = res_json["audio"].split(",")[1]
audio_bytes = base64.b64decode(audio_base64)
audio_io = BytesIO(audio_bytes)
sample_rate, audio_data = wavfile.read(audio_io)
return sample_rate, audio_data
def make_image_talk(image):
description = analyze_image(image)
sample_rate, audio_data = text_to_speech(description)
return sample_rate, audio_data
if __name__ == "__main__":
api_token = os.environ.get("DEEPINFRA_API_TOKEN")
interface = gr.Interface(
fn=make_image_talk,
inputs=gr.Image(type="pil"),
outputs=gr.Audio(type="numpy"),
title="Art That Talks Back",
description="Upload an image and hear it talk!"
)
interface.launch()
Ready to hear your own art talk back? Grab yourself an image, run the code, and upload it. Do not forget to follow us on Linkedin and on X.
Qwen3 Coder 480B A35B API Benchmarks: Latency & Cost<p>About Qwen3 Coder 480B A35B Instruct Qwen3 Coder 480B A35B Instruct is a state-of-the-art large language model developed by the Qwen team at Alibaba Cloud, specifically designed for code generation and agentic coding tasks. It is a Mixture-of-Experts (MoE) model with 480 billion total parameters and 35 billion active parameters per inference, enabling high performance […]</p>
Qwen3.5 35B A3B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 35B A3B Qwen3.5 35B A3B is a native vision-language model released by Alibaba Cloud in February 2026. It uses a hybrid architecture that integrates Gated Delta Networks with a sparse Mixture-of-Experts model, achieving higher inference efficiency. With 35 billion total parameters and only 3 billion activated per token through 256 experts (8 routed […]</p>
DeepSeek V4 Pro: Model Overview, Features & Performance Guide<p>DeepSeek V4 Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model from DeepSeek, released on April 24, 2026 under the MIT license. It is designed for advanced reasoning, complex software engineering, and long-running agentic tasks, and arrives alongside DeepSeek-V4-Flash, a lighter 284B-parameter variant built for faster, lower-cost inference. The V4 series is DeepSeek’s first two-tier lineup […]</p>
© 2026 DeepInfra. All rights reserved.