NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models.
Scoring 38 on the Artificial Analysis Intelligence Index — well above the comparable open-weights median of 27 — Step 3.5 Flash features a 256k token context window (roughly 384 A4 pages), extended chain-of-thought reasoning controllable via a reasoning_effort parameter, native tool calling with parallel function invocation, and JSON mode for structured output. The model is released under the Apache 2.0 license, enabling commercial use and third-party hosting on platforms like DeepInfra.
It’s a highly verbose model during reasoning — generating an average of 200M tokens during intelligence evaluations versus a median of 17M for comparable models — which makes cost efficiency a critical factor when selecting an inference provider.
Step 3.5 Flash is now available across multiple API providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Why Notable | Input ($/1M) | Output ($/1M) | Latency (TTFT) | Speed (t/s) | Best Use Case |
|---|---|---|---|---|---|---|
| DeepInfra | Industry-leading TTFT (~0.32s) with competitive pricing; JSON mode + function calling | $0.10 | $0.30 | ~0.32s | 77–88 | Real-time applications, conversational agents |
| SiliconFlow (FP8) | Highest raw throughput at 100.4 t/s for batch workloads | ~$0.15 blended | ~$0.15 blended | 2.17s | 100.4 | High-volume generation, batch processing |
| StepFun (first-party) | Primary reference API from the model creator; high throughput baseline | $0.10 | $0.30 | 3.19s | 95.2 | Batch workloads, non-interactive applications |
| OpenRouter | API aggregator routing across providers for maximum uptime and redundancy | $0.10 | $0.30 | Varies | Varies | Enterprise uptime requirements, API routing |
Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale Step 3.5 Flash deployment. It offers an industry-leading TTFT of ~0.32 seconds — nearly 10x faster than StepFun’s first-party API — at matching competitive pricing ($0.10 input / $0.30 output). For maximum raw throughput, SiliconFlow leads at 100.4 t/s. For enterprise uptime requirements, OpenRouter provides routing redundancy across providers.
DeepInfra stands out as the overall recommended provider for Step 3.5 Flash, striking the optimal balance between ultra-low latency, competitive pricing, and full feature support.
Reasoning models like Step 3.5 Flash require thinking time before outputting an answer, which inherently increases end-to-end response times. DeepInfra mitigates this with a TTFT of ~0.32 seconds — compared to the 2–3 second averages seen at other providers. Given the model’s verbose reasoning behavior, this latency advantage compounds significantly for interactive applications where users are waiting for the first token.
DeepInfra also matches the baseline competitive pricing of $0.10/$0.30 for input/output tokens while adding full JSON Mode and Function Calling support — making it the most cost-efficient and responsive choice for developers building real-time agentic applications.
For workloads where raw output speed is prioritized over initial response time, SiliconFlow running FP8 quantization is the leading alternative.
At 100.4 tokens/sec, SiliconFlow surpasses the Step 3.5 Flash baseline average of 82.2 t/s. For workloads involving large-scale code generation, long-context reasoning tasks, or batch document processing where the 2.17-second initial latency is acceptable, SiliconFlow provides the highest throughput available. For conversational agents requiring immediate user feedback, the higher TTFT makes it a less optimal choice than DeepInfra.
Using the model creator’s first-party API is a standard route for enterprise developers prioritizing reliability and direct vendor support.
The StepFun API offers solid throughput at 95.2 t/s and competitive pricing that matches DeepInfra. The primary drawback is latency — a TTFT of over 3.2 seconds means end-users will experience a noticeable delay before the model begins generating. For batch workloads or non-interactive applications, StepFun remains a solid choice as the authoritative first-party provider. For interactive applications, DeepInfra’s 10x latency advantage is decisive.
For enterprise applications with strict uptime requirements, OpenRouter serves as a routing layer rather than a standalone inference host.
OpenRouter does not host Step 3.5 Flash directly but routes API requests to the best available providers — including DeepInfra and StepFun — to maintain operational redundancy. It passes through the standard $0.10/$0.30 pricing structure while natively supporting the model’s full context window. For production environments where API redundancy is a strict requirement, OpenRouter is a practical choice.
Step 3.5 Flash features a 256k token context window, equivalent to processing approximately 384 standard A4 pages of text in a single prompt.
While StepFun is the model creator, DeepInfra offers a significantly lower TTFT (~0.32 seconds vs. StepFun’s 3.19 seconds) at the same price point, making it far better suited for real-time and conversational applications. DeepInfra also supports both JSON Mode and Function Calling.
No. Step 3.5 Flash is a text-only model supporting text input and text output. It does not support image input or other multimodal capabilities.
Step 3.5 Flash is released under the Apache 2.0 license, which permits commercial use and enables third-party hosting on platforms like DeepInfra.
Step 3.5 Flash uses a Mixture of Experts (MoE) architecture with 196 billion total parameters and approximately 11 billion active parameters per token during inference.
Step 3.5 Flash is a highly capable open-weights reasoning model that competes aggressively on both intelligence metrics and operational cost. Scoring 38 on the Artificial Analysis Intelligence Index — well above the open-weights median of 27 — it delivers enterprise-grade reasoning at a fraction of the cost of comparable closed-source models.
For the vast majority of Step 3.5 Flash deployments, DeepInfra is the clear overall recommendation. Its unmatched TTFT of ~0.32 seconds combined with competitive pricing ($0.10 input / $0.30 output) and full JSON Mode and Function Calling support makes it the optimal infrastructure for real-time agentic applications.
Power the Next Era of Image Generation with FLUX.2 Visual Intelligence on DeepInfraDeepInfra is excited to support FLUX.2 from day zero, bringing the newest visual intelligence model from Black Forest Labs to our platform at launch. We make it straightforward for developers, creators, and enterprises to run the model with high performance, transparent pricing, and an API designed for productivity.
How to deploy google/flan-ul2 - simple. (open source ChatGPT alternative)Flan-UL2 is probably the best open source model available right now for chatbots. In this post
we will show you how to get started with it very easily. Flan-UL2 is large -
20B parameters. It is fine tuned version of the UL2 model using Flan dataset.
Because this is quite a large model it is not eas...
Compare Llama2 vs OpenAI models for FREE.At DeepInfra we host the best open source LLM models. We are always working hard to make
our APIs simple and easy to use.
Today we are excited to announce a very easy way to quickly try our models like
Llama2 70b and
[Mistral 7b](/mistralai/Mistral-7B-Instruc...© 2026 Deep Infra. All rights reserved.