We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Langchain improvements: async and streaming
Published on 2023.10.25 by Iskren Chernev
Langchain improvements: async and streaming

Starting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra.

Async generation

The deepinfra wrapper now supports native async calls, so you can expect more performance (no more threads per invocation) from your async pipelines.

from langchain.llms.deepinfra import DeepInfra

async def async_predict():
    llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
    output = await llm.apredict("What is 2 + 2?")
    print(output)
copy

Response streaming

Streaming lets you receive each token of the response as it gets generated. This is indispensable in user-facing applications.

def streaming():
    llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
    for chunk in llm.stream("[INST] Hello [/INST] "):
        print(chunk, end='', flush=True)
    print()
copy

You can also use the asynchronous streaming API, natively implemented underneath.

async def async_streaming():
    llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
    async for chunk in llm.astream("[INST] Hello [/INST] "):
        print(chunk, end='', flush=True)
    print()
copy
Related articles
OpenClaw Security: Prevent Prompt Injection & Supply Chain AttacksOpenClaw Security: Prevent Prompt Injection & Supply Chain Attacks<p>In early 2026, the China’s Ministry of Industry and Information Technology issued an emergency warning about an AI agent runtime that had quietly grown to 135,000 GitHub stars. By mid-February, security researchers were tracking a coordinated campaign called ClawHavoc. The Moltbook breach had exposed customer email archives from 41 enterprises. OpenClaw&#8217;s maintainers had shipped three [&hellip;]</p>
NVIDIA Nemotron 3 Super: Model Overview & Integration GuideNVIDIA Nemotron 3 Super: Model Overview & Integration Guide<p>The NVIDIA Nemotron 3 Super is a state-of-the-art 120-billion parameter hybrid Mixture-of-Experts (MoE) model designed to bridge the gap between high-compute efficiency and extreme accuracy. Engineered specifically for the next generation of AI development, Nemotron 3 Super excels in multi-agent applications, specialized agentic systems, and complex reasoning tasks. By utilizing a sophisticated architecture that activates [&hellip;]</p>
Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfraIntroducing NVIDIA Nemotron 3 Nano Omni on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano Omni, the first multimodal model in the Nemotron 3 family — a single open model that understands images, video, audio, documents, and text in one unified inference pass.