We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Langchain improvements: async and streaming
Published on 2023.10.25 by Iskren Chernev
Langchain improvements: async and streaming

Starting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra.

Async generation

The deepinfra wrapper now supports native async calls, so you can expect more performance (no more threads per invocation) from your async pipelines.

from langchain.llms.deepinfra import DeepInfra

async def async_predict():
    llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
    output = await llm.apredict("What is 2 + 2?")
    print(output)
copy

Response streaming

Streaming lets you receive each token of the response as it gets generated. This is indispensable in user-facing applications.

def streaming():
    llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
    for chunk in llm.stream("[INST] Hello [/INST] "):
        print(chunk, end='', flush=True)
    print()
copy

You can also use the asynchronous streaming API, natively implemented underneath.

async def async_streaming():
    llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
    async for chunk in llm.astream("[INST] Hello [/INST] "):
        print(chunk, end='', flush=True)
    print()
copy
Related articles
Nemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra ResultsNemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM landscape is becoming increasingly diverse, with models optimized for reasoning, throughput, cost-efficiency, and real-world agentic applications. Two models that stand out in this new generation are NVIDIA’s Nemotron 3 Nano and OpenAI’s GPT-OSS-20B, both of which offer strong performance while remaining openly available and deployable across cloud and edge systems. Although both [&hellip;]</p>
Seed Anchoring and Parameter Tweaking with SDXL Turbo: Create Stunning Cubist ArtSeed Anchoring and Parameter Tweaking with SDXL Turbo: Create Stunning Cubist ArtIn this blog post, we're going to explore how to create stunning cubist art using SDXL Turbo using some advanced image generation techniques.
Search That Actually Works: A Guide to LLM RerankersSearch That Actually Works: A Guide to LLM RerankersSearch relevance isn’t a nice-to-have feature for your site or app. It can make or break the entire user experience. When a customer searches "best laptop for video editing" and gets results for gaming laptops or budget models, they leave empty-handed. Embeddings help you find similar content, bu...