Table of Contents
Fetching ...

EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow

Dogyun Park, Yanyu Li, Sergey Tulyakov, Anil Kag

Abstract

Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.

EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow

Abstract

Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.

Paper Structure

This paper contains 33 sections, 3 theorems, 34 equations, 15 figures, 6 tables, 2 algorithms.

Key Result

lemma 1

Let $g(x_t,t,s,c)$ be any mapping that satisfies the boundary condition $g(x_t,t,t,c)=x_t$. Fix $0\le s < \ell < t \le 1$ and define $x_\ell \triangleq g(x_t,t,\ell,c)$. Then the mean-velocity additivity residual in the main paper simplifies exactly to the semigroup defect: Consequently, substituting $x_\ell$ back into the norm yields:

Figures (15)

  • Figure 1: Text-to-video results generated by our EFlow with 4 inference steps.
  • Figure 2: (a) Feed-forward latency vs. token number for different attention designs. Softmax attention (Wan 2.1) scales poorly as the sequence length grows while our GLGA (EFlow) is significantly faster than softmax and remains competitive with sparse/linear baselines. Applying 75% token dropping on GLGA yields the lowest latency and the most favorable scaling. (b) 480p inference latency for our model and recent video generators. Our model achieves 1.4$\times$ and 45.3$\times$ faster diffusion inference over baselines by combining an efficient backbone with few-step sampling.
  • Figure 3: Overview of our VideoDiT architecture. (a) The network processes video tokens through stacked DiT blocks, utilizing long skip connections to stabilize information flow when tokens are aggressively dropped. (b) Given a DiT block with Gated Local–Global Attention (GLGA) module, input tokens are projected into shared queries, keys, and values. These features are processed by two parallel mechanisms: a linear attention branch for efficient global context and a sliding-window attention branch for expressive local detail. Finally, an input-aware gating ($g$) adaptively fuses the global ($O^{\text{global}}$) and local ($O^{\text{local}}$) outputs on per-token basis.
  • Figure 4: End-to-end efficiency gains from our framework. (a) Training efficiency measured as throughput (iterations/sec). Relative to the standard flow-matching baseline, introducing the solution-flow (SoFlow) objective and the MVA regularizer increases per-iteration overhead and therefore reduces throughput. In contrast, our efficient backbone, PDG training, and token dropping provide consistent speedups, improving throughput by up to $1.6\times$ and $2.5\times$ over the baselines. (b) Inference efficiency. Our overall framework substantially reduces end-to-end inference time relative to the softmax+flow-matching baseline ($147\rightarrow10.09$s).
  • Figure 5: Systemic comparison on Kinetics 700. (a) Training loss vs. iterations comparing the softmax-attention baseline, Sana-Video attention, and our GLGA module. (b) FVD vs. training compute across various model scales for softmax, Sana-Video attention, sparse-linear attention (SLA), and GLGA. GLGA consistently achieves lower loss and lower FVD at matched compute, and its advantage persists as model scale increases; SLA become unstable and diverge under the same training setup.
  • ...and 10 more figures

Theorems & Definitions (5)

  • lemma 1: MVA residual equals semigroup defect
  • proof
  • theorem 1: Long-jump error decomposes into short-jump errors + semigroup defect
  • proof
  • corollary 1: Connection to SoFlow-style bounds and quadratic shrinkage