Table of Contents
Fetching ...

Generative Modeling via Drifting

Mingyang Deng, He Li, Tianhong Li, Yilun Du, Kaiming He

TL;DR

Drifting Models introduce a training-time evolution of the pushforward distribution $q = f_\theta{}_{\#} p_{\boldsymbol{\epsilon}}$ via a drifting field $\mathbf{V}_{p,q}$ that vanishes at equilibrium when $q = p_{\text{data}}$, enabling a single-pass, one-step generator. The method employs a kernelized attraction-to-data and repulsion-from-generated-samples drift, with a fixed-point training objective and stop-gradient targets to align $q$ with $p_{\text{data}}$. It extends drifting to feature space, supports multi-scale representations, and can incorporate classifier-free guidance by conditioning on class and unconditional data. Empirically, it achieves state-of-the-art 1-NFE FID scores on ImageNet 256×256 in both latent ($\mathrm{FID}=1.54$) and pixel space ($\mathrm{FID}=1.61$), and demonstrates strong performance in latent and pixel-space generation as well as robotics control, illustrating a practical, diffusion-free paradigm for high-quality, efficient generation.

Abstract

Generative modeling can be formulated as learning a mapping f such that its pushforward distribution matches the data distribution. The pushforward behavior can be carried out iteratively at inference time, for example in diffusion and flow-based models. In this paper, we propose a new paradigm called Drifting Models, which evolve the pushforward distribution during training and naturally admit one-step inference. We introduce a drifting field that governs the sample movement and achieves equilibrium when the distributions match. This leads to a training objective that allows the neural network optimizer to evolve the distribution. In experiments, our one-step generator achieves state-of-the-art results on ImageNet at 256 x 256 resolution, with an FID of 1.54 in latent space and 1.61 in pixel space. We hope that our work opens up new opportunities for high-quality one-step generation.

Generative Modeling via Drifting

TL;DR

Drifting Models introduce a training-time evolution of the pushforward distribution via a drifting field that vanishes at equilibrium when , enabling a single-pass, one-step generator. The method employs a kernelized attraction-to-data and repulsion-from-generated-samples drift, with a fixed-point training objective and stop-gradient targets to align with . It extends drifting to feature space, supports multi-scale representations, and can incorporate classifier-free guidance by conditioning on class and unconditional data. Empirically, it achieves state-of-the-art 1-NFE FID scores on ImageNet 256×256 in both latent () and pixel space (), and demonstrates strong performance in latent and pixel-space generation as well as robotics control, illustrating a practical, diffusion-free paradigm for high-quality, efficient generation.

Abstract

Generative modeling can be formulated as learning a mapping f such that its pushforward distribution matches the data distribution. The pushforward behavior can be carried out iteratively at inference time, for example in diffusion and flow-based models. In this paper, we propose a new paradigm called Drifting Models, which evolve the pushforward distribution during training and naturally admit one-step inference. We introduce a drifting field that governs the sample movement and achieves equilibrium when the distributions match. This leads to a training objective that allows the neural network optimizer to evolve the distribution. In experiments, our one-step generator achieves state-of-the-art results on ImageNet at 256 x 256 resolution, with an FID of 1.54 in latent space and 1.61 in pixel space. We hope that our work opens up new opportunities for high-quality one-step generation.
Paper Structure (82 sections, 1 theorem, 53 equations, 15 figures, 11 tables, 2 algorithms)

This paper contains 82 sections, 1 theorem, 53 equations, 15 figures, 11 tables, 2 algorithms.

Key Result

Proposition 3.1

Consider an anti-symmetric drifting field: Then we have: $\quad q=p \quad \Rightarrow \quad \mathbf{V}_{p,q}(\mathbf{x}) = \mathbf{0},\forall \mathbf{x}$.

Figures (15)

  • Figure 1: Drifting Model. A network $f$ performs a pushforward operation: $q={f}_\# p_{\text{prior}}$, mapping a prior distribution $p_{\text{prior}}$ (e.g., Gaussian, not shown here) to a pushforward distribution $q$ (orange). The goal of training is to approximate the data distribution $p_{\text{data}}$ (blue). As training iterates, we obtain a sequence of models $\{f_i\}$, which corresponds to a sequence of pushforward distributions $\{q_i\}$. Our Drifting Model focuses on the evolution of this pushforward distribution at training-time. We introduce a drifting field (detailed in main text) that approaches zero when $q$ matches $p_{\text{data}}$. This drifting field provides a loss function (y-axis, in log-scale) for training.
  • Figure 2: Illustration of drifting a sample. A generated sample $\mathbf{x}$ (black) drifts according to a vector: $\mathbf{V}=\mathbf{V}^+_{p}-\mathbf{V}^-_{q}$. Here, $\mathbf{V}^+_{p}$ is the mean-shift vector of the positive samples (blue) and $\mathbf{V}^-_{q}$ is the mean-shift vector of the negative samples (orange): see Eq. (\ref{['eq:meanshift']}). $\mathbf{x}$ is attracted by $\mathbf{V}^+_{p}$ and repulsed by $\mathbf{V}^-_{q}$.
  • Figure 3: Evolution of the generated distribution. The distribution $q$ (orange) evolves toward a bimodal target $p$ (blue) during training. We show three initializations of $q$: (top): initialized between the two modes; (middle): initialized far from both modes; (bottom): initialized collapsed onto one mode. Across all initializations, our method approximates the target distribution without mode collapse.
  • Figure 4: Evolution of samples. We show generated points sampled at different training iterations, along with their loss values. The loss (whose value equals $\|V\|^2$) decreases as the distribution converges to the target. (y-axis is log-scale.)
  • Figure 5: Effect of CFG scale $\alpha$.(a): FID vs. $\alpha$. (b): IS vs. $\alpha$. (c): IS vs. FID. We show the L/2 (solid) and B/2 (dashed) models. Consistent with common observations in diffusion-/flow-based models, the CFG scale effectively trades off distributional coverage (as reflected by FID) against per-image quality (measured by IS). Notably, with the L/2 model, the optimal FID is achieved at $\alpha{=}1.0$, which is often regarded as "w/o CFG" in diffusion-/flow-based models. For B/2, the optimal FID is achieved at $\alpha{=}1.1$.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Proposition 3.1