Table of Contents
Fetching ...

ProReflow: Progressive Reflow with Decomposed Velocity

Lei Ke, Haohang Xu, Xuefei Ning, Yu Li, Jiajun Li, Haoling Li, Yuxuan Lin, Dongsheng Jiang, Yujiu Yang, Linfeng Zhang

TL;DR

ProReflow addresses the computational bottleneck of diffusion model sampling by introducing two complementary techniques: Progressive ReFlow, which employs curriculum-like, windowed reflow from local timesteps to the full trajectory, and Aligned V-Prediction, which prioritizes velocity direction over magnitude via a cosine-based directional loss. Together, these methods enable high-quality, few-step generation on SDv1.5 and SDXL with substantially reduced training cost, achieving state-of-the-art results at 4-step sampling (e.g., FID $=10.70$ on COCO-2014 using a 32-step teacher). The work provides both quantitative gains (lower FID and strong CLIP scores) and qualitative improvements (finer detail, better global structure) while offering theory-backed explanations via curriculum learning and privileged-information-inspired distillation. Practically, ProReflow reduces inference latency and computational demand for diffusion-based synthesis, making few-step or near real-time generation more feasible on large-scale models.

Abstract

Diffusion models have achieved significant progress in both image and video generation while still suffering from huge computation costs. As an effective solution, flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation. However, in this paper, we suggest that the original training pipeline of flow matching is not optimal and introduce two techniques to improve it. Firstly, we introduce progressive reflow, which progressively reflows the diffusion models in local timesteps until the whole diffusion progresses, reducing the difficulty of flow matching. Second, we introduce aligned v-prediction, which highlights the importance of direction matching in flow matching over magnitude matching. Experimental results on SDv1.5 and SDXL demonstrate the effectiveness of our method, for example, conducting on SDv1.5 achieves an FID of 10.70 on MSCOCO2014 validation set with only 4 sampling steps, close to our teacher model (32 DDIM steps, FID = 10.05).

ProReflow: Progressive Reflow with Decomposed Velocity

TL;DR

ProReflow addresses the computational bottleneck of diffusion model sampling by introducing two complementary techniques: Progressive ReFlow, which employs curriculum-like, windowed reflow from local timesteps to the full trajectory, and Aligned V-Prediction, which prioritizes velocity direction over magnitude via a cosine-based directional loss. Together, these methods enable high-quality, few-step generation on SDv1.5 and SDXL with substantially reduced training cost, achieving state-of-the-art results at 4-step sampling (e.g., FID on COCO-2014 using a 32-step teacher). The work provides both quantitative gains (lower FID and strong CLIP scores) and qualitative improvements (finer detail, better global structure) while offering theory-backed explanations via curriculum learning and privileged-information-inspired distillation. Practically, ProReflow reduces inference latency and computational demand for diffusion-based synthesis, making few-step or near real-time generation more feasible on large-scale models.

Abstract

Diffusion models have achieved significant progress in both image and video generation while still suffering from huge computation costs. As an effective solution, flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation. However, in this paper, we suggest that the original training pipeline of flow matching is not optimal and introduce two techniques to improve it. Firstly, we introduce progressive reflow, which progressively reflows the diffusion models in local timesteps until the whole diffusion progresses, reducing the difficulty of flow matching. Second, we introduce aligned v-prediction, which highlights the importance of direction matching in flow matching over magnitude matching. Experimental results on SDv1.5 and SDXL demonstrate the effectiveness of our method, for example, conducting on SDv1.5 achieves an FID of 10.70 on MSCOCO2014 validation set with only 4 sampling steps, close to our teacher model (32 DDIM steps, FID = 10.05).

Paper Structure

This paper contains 24 sections, 14 equations, 5 figures, 4 tables, 3 algorithms.

Figures (5)

  • Figure 1: (a) L2 distance and Cosine similarity across velocities at different timesteps, the velocity discrepancy between timesteps increases with their distance in timesteps. (b) The consistently larger FID degradation under directional noise demonstrates that velocity direction is more critical for generation quality.
  • Figure 2: Conceptual illustration of different methods. (a)--(e) compare training objectives and sampling trajectories across different methods. Arrows show optimization targets, and red dashed lines represent actual sampling trajectories, which are curved due to the optimization not achieving the theoretical optimum. (e) shows our progressive reflow method achieves better approximation. (f) presents how our proposed aligned v-prediction works between timesteps $[t,t+1 ]$, it reduces prediction deviation with velocity direction correction.
  • Figure 3: Performance of our models under different factors of classifier-free guidance (CFG) on COCO-2017. CFG scale ranges from 2 to 7. I and II stands for ProReflow- I with 4 steps and ProReflow- II with 2 steps, respectively.
  • Figure 4: FID on COCO-30K. The yellow curve shows results trained with 4 windows and evaluated using 4 inference steps, while the blue curve represents the model trained with 8 windows and evaluated using 8 inference steps. Both configurations are compared against their baselines where $\alpha=0$ (MSE loss only). Each model is trained for 10,000 iterations with batch size 32.
  • Figure 5: Qualitative comparison of image generation results. Our method demonstrates superior performance in detail rendering compared to other flow-based approaches at both 2-steps and 4-steps sampling.