Improving the Training of Rectified Flows

Sangyun Lee; Zinan Lin; Giulia Fanti

Improving the Training of Rectified Flows

Sangyun Lee, Zinan Lin, Giulia Fanti

TL;DR

The paper tackles the costly sampling problem in diffusion models by focusing on rectified flows and the Reflow training algorithm. It demonstrates that, in practical settings, a single Reflow iteration suffices to produce near-straight ODE trajectories when combined with targeted training enhancements, enabling competitive 1-2 NFE sampling. By introducing a U-shaped timestep distribution, LPIPS-Huber premmetrics, diffusion-model initialization, and real-data incorporation, the authors achieve state-of-the-art or competitive FID results on CIFAR-10 and ImageNet-64x64 with 1-2 NFEs, while providing insights into sampling efficiency and inversion. The work argues for a shift toward more effective one-round training of rectified flows as a practical alternative to distillation methods in the low-NFE regime, with broad implications for fast, invertible generative modeling.

Abstract

Diffusion models have shown great promise for image and video generation, but sampling from state-of-the-art models requires expensive numerical integration of a generative ODE. One approach for tackling this problem is rectified flows, which iteratively learn smooth ODE paths that are less susceptible to truncation error. However, rectified flows still require a relatively large number of function evaluations (NFEs). In this work, we propose improved techniques for training rectified flows, allowing them to compete with \emph{knowledge distillation} methods even in the low NFE setting. Our main insight is that under realistic settings, a single iteration of the Reflow algorithm for training rectified flows is sufficient to learn nearly straight trajectories; hence, the current practice of using multiple Reflow iterations is unnecessary. We thus propose techniques to improve one-round training of rectified flows, including a U-shaped timestep distribution and LPIPS-Huber premetric. With these techniques, we improve the FID of the previous 2-rectified flow by up to 75\% in the 1 NFE setting on CIFAR-10. On ImageNet 64$\times$64, our improved rectified flow outperforms the state-of-the-art distillation methods such as consistency distillation and progressive distillation in both one-step and two-step settings and rivals the performance of improved consistency training (iCT) in FID. Code is available at https://github.com/sangyun884/rfpp.

Improving the Training of Rectified Flows

TL;DR

Abstract

64, our improved rectified flow outperforms the state-of-the-art distillation methods such as consistency distillation and progressive distillation in both one-step and two-step settings and rivals the performance of improved consistency training (iCT) in FID. Code is available at https://github.com/sangyun884/rfpp.

Paper Structure (28 sections, 1 theorem, 16 equations, 21 figures, 6 tables, 2 algorithms)

This paper contains 28 sections, 1 theorem, 16 equations, 21 figures, 6 tables, 2 algorithms.

Introduction
Background
Rectified Flow
Reflow
Applying Reflow Once is Sufficient
Improved Training Techniques for Reflow
Timestep distribution
Loss function
Initialization with pre-trained diffusion models
Incorporating real data
Experiments
Unconditional and class-conditional image generation
Reflow can be computationally more efficient than other distillation methods
Effects of samplers
Inversion
...and 13 more sections

Key Result

Proposition 1

Let $p^{\text{RE}}(\mathbf{x} | \mathbf{x}_t, t)$ be the posterior distribution of the perturbation kernel $\mathcal{N}((1-t) \mathbf{x}, t^2\mathbf{I})$. Also, let $p^{\text{VP}}(\mathbf{x} | \mathbf{x}_t, t)$ and $p^{\text{VE}}(\mathbf{x} | \mathbf{x}_t, t)$ be the posterior distributions of $\mat where $s_{\text{VP}}$ and $s_{\text{VE}}$ are the scaling factors and $t_{\text{VP}}$ and $t_{\text

Figures (21)

Figure 1: Rectified flow process (figure modified from liu2022flow). Rectified flow rewires trajectories so there are no intersecting trajectories $(a)\to (b)$. Then, we take noise samples from $p_\mathbf{z}$ and their generated samples from $p^1_\mathbf{x}$, and linearly interpolate them $(c)$. In Reflow, rectified flow is applied again $(c)\to (d)$ to straighten flows. This procedure is repeated recursively.
Figure 2: An illustration of the intuition in Sec. \ref{['sec:claim']}. (a) If two linear interpolation trajectories intersect, $\mathbf{z}" - \mathbf{z}'$ is parallel to $\mathbf{x}' - \mathbf{x}"$. This generally maps $\mathbf{z}"$ to an atypical (e.g., one with high autocorrelation or a norm that is too large to be on a Gaussian annulus) realization of Gaussian noise, so the 1-rectified flow cannot reliably map $\mathbf{z}"$ to $\mathbf{x}"$ on $\mathcal{M}_{\mathbf{x}}$. (b) Generated samples from the pre-trained 1-rectified flow starting from $\mathbf{z}\sim \mathcal{N}(\mathbf{0},\mathbf{I})$ (right), which is the standard setting, and $\mathbf{z} "=\mathbf{z} + (\mathbf{x}' - \mathbf{x}")$, where $\mathbf{x}',\mathbf{x}"$ are sampled from 1-rectified flow trained on CIFAR-10 (left). Qualitatively, we see that the left samples have very low quality. (c) Empirically, we show the $\ell_2$ norm of $z"=\mathbf{z} + (\mathbf{x}' - \mathbf{x}")$ compared to $z'$, which is sampled from the standard Gaussian. $\mathbf{z}"$ generally lands outside the annulus of typical Gaussian noise. (d) $\mathbf{z} + (\mathbf{x}' - \mathbf{x}")$ has high autocorrelation while the autocorrelation of Gaussian noise is nearly zero in high-dimensional space.
Figure 3: Training loss of the vanilla 2-rectified flow on CIFAR-10 measured on $5,000$ samples after $200,000$ iterations. The shaded area represents the 1 standard deviation of the loss. The dashed curve is our U-shaped timestep distribution, scaled by a constant factor for visualization.
Figure 4: Effects of ODE Solver and new update rule.
Figure 5: Inversion results on CIFAR-10. (a) Reconstruction error between real and reconstructed data is measured by the mean squared error (MSE), where the x-axis represents NFEs used for inversion and reconstruction (e.g. 2 means 2 for inversion and 2 for reconstruction). (b) Distribution of $||\mathbf{z}||_2^2$ of the inverted noises as a proxy for Gaussianity (NFE = 8). The green histogram represents the distribution of true noise, which is Chi-squared with $3 \times 32 \times 32 = 3072$ degrees of freedom. (c) Inversion and reconstruction results using (8 + 8) NFEs. With only 8 NFEs, EDM fails to produce realistic noise, and also the reconstructed samples are blurry.
...and 16 more figures

Theorems & Definitions (1)

Proposition 1

Improving the Training of Rectified Flows

TL;DR

Abstract

Improving the Training of Rectified Flows

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (1)