Table of Contents
Fetching ...

StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation

Sen Fang, Hongbin Zhong, Yalin Feng, Dimitris N. Metaxas

TL;DR

Rectified Flow models suffer from slow inference due to their dynamic time-step structures. StreamFlow introduces a cohesive acceleration framework with batched velocity-field processing, vectorized time windows, and runtime-adaptive TensorRT compilation to handle heterogeneous timesteps. The approach achieves up to 611% speedup on 512×512 images and maintains high generation quality, with robust scalability to larger resolutions. This work enables practical deployment of large-scale flow-based generative models by delivering substantial throughput gains without sacrificing fidelity.

Abstract

New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512*512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.

StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation

TL;DR

Rectified Flow models suffer from slow inference due to their dynamic time-step structures. StreamFlow introduces a cohesive acceleration framework with batched velocity-field processing, vectorized time windows, and runtime-adaptive TensorRT compilation to handle heterogeneous timesteps. The approach achieves up to 611% speedup on 512×512 images and maintains high generation quality, with robust scalability to larger resolutions. This work enables practical deployment of large-scale flow-based generative models by delivering substantial throughput gains without sacrificing fidelity.

Abstract

New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512*512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.

Paper Structure

This paper contains 28 sections, 7 equations, 5 figures, 2 tables, 3 algorithms.

Figures (5)

  • Figure 1: The differences between original denoising and batch denoising: We can consider each line as a separate parallel operation queue. (a) This is the original diffusion generation process. Suppose we need 5 steps to complete it, then denoising is sequential. (b) Then some common acceleration methods involve using multiple parallel operation queues, with each queue only considering the requirements of a certain step. The grey area represents the complete process of a certain generation.
  • Figure 2: Overview of our StreamFlow pipeline:(Pass 1) The process initiates with prompt processing via the Rectified Flow Backbone and partial prefilling to establish the continuous velocity field $v_t(X_t)$ based on the trajectory equation $dX_t = v_t(X_t)dt$. (Pass 2) The Decoupled Time-Step Binding mechanism manages parallel queues with heterogeneous step sizes (e.g., $\Delta t_a, \Delta t_b$), independently applying velocity updates before converging via in-place aggregation for arbitrary batching. (Pass 3) Heterogeneous time steps $[t_i]$ are vectorized and managed by the Unified Adaptive Scheduler, which handles varying temporal gradients and step sizes to synchronize velocity outputs for subsequent partial prefilling. (Pass 4) We undergoes Adaptive Model Construction Compilation, utilizing a Dynamic TensorRT Compiler for plug-and-play compatibility and structural optimization to create an Optimized Execution Engine.
  • Figure 3: Detailed implementation of StreamFlow's batched velocity field processing and heterogeneous timestep pipeline. Starting from prompt and reference image inputs, Pass 1 performs TensorRT-optimized/VFB (Velocity Field Batching) prefilling to establish initial latent representations. Pass 2 demonstrates the core velocity field batch processing: multiple RF model instances process parallel queues (Q0, Q1, Q2) with heterogeneous timesteps, where each queue independently computes velocity fields before synchronization through reranking modules. Pass 3 shows the asynchronous queue management: the unified adaptive scheduler coordinates multiple parallel streams, each progressing through partial prefilling → full prefilling → VAE decoding stages at different rates. Pass 4 illustrates the dynamic dependency graph that enables proper data flow across heterogeneous processing stages. Data dependency arrows show how velocity field outputs from Pass 2 feed into the scheduler, which then manages the asynchronous progression of multiple generation streams in Pass 3, achieving continuous throughput without blocking.
  • Figure 4: Ablation Study: We conducted a detailed ablation study on all the components, and the experiments proved that all the components have a significantly higher average generation speed compared to the original model or the official pipeline of Huggingface. Additionally, in the fastest case, our peak improvement can reach 11/1.8, which is approximately a 611% increase from the original speed.
  • Figure 5: Scalability Study: By changing the target size of the generated images, we found that our method has much stronger robustness compared to the previous methods. We are hardly affected by too much size variation. Even at relatively large sizes, we still achieve an acceleration of almost four to five times compared to the baseline. This not only demonstrates the robustness of our method, but also proves that our approach is a true method-to-strategy acceleration, rather than taking any expedient path.