Table of Contents
Fetching ...

Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective

Erkan Turan, Maks Ovsjanikov

TL;DR

This paper makes the following observation: under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions, and it is proved that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee.

Abstract

Generative Modeling via Drifting has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet the success is largely empirical and its theoretical foundations remain poorly understood. In this paper, we make the following observation: \emph{under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions}. This insight allows us to answer all three key questions left open in the original work: (1) whether a vanishing drift guarantees equality of distributions ($V_{p,q}=0\Rightarrow p=q$), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the well-studied score-matching family and enable a rich theoretical perspective. By linearizing the McKean-Vlasov dynamics and probing them in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emph{Landau damping} in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, explaining the empirical preference for the Laplacian kernel. We also propose an exponential bandwidth annealing schedule $σ(t)=σ_0 e^{-rt}$ that reduces convergence time from $\exp(O(K_{\max}^2))$ to $O(\log K_{\max})$. Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, demonstrated with a Sinkhorn divergence drift.

Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective

TL;DR

This paper makes the following observation: under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions, and it is proved that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee.

Abstract

Generative Modeling via Drifting has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet the success is largely empirical and its theoretical foundations remain poorly understood. In this paper, we make the following observation: \emph{under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions}. This insight allows us to answer all three key questions left open in the original work: (1) whether a vanishing drift guarantees equality of distributions (), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the well-studied score-matching family and enable a rich theoretical perspective. By linearizing the McKean-Vlasov dynamics and probing them in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emph{Landau damping} in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, explaining the empirical preference for the Laplacian kernel. We also propose an exponential bandwidth annealing schedule that reduces convergence time from to . Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, demonstrated with a Sinkhorn divergence drift.
Paper Structure (86 sections, 12 theorems, 166 equations, 5 figures)

This paper contains 86 sections, 12 theorems, 166 equations, 5 figures.

Key Result

Theorem 4.1

Under the Gaussian kernel $\varphi_\sigma$, the drift operator admits the closed form expression: where $p_\sigma:=p*\varphi_\sigma$ and $q_\sigma:=q*\varphi_\sigma$.

Figures (5)

  • Figure 1: Numerical confirmation of Theorem \ref{['thm:score_matching']} on a 4-mode Gaussian mixture. (a) Empirical kernel mean-shift drift ($N=50$k samples). (b) Analytical score-difference form $\sigma^2\nabla\log(p_\sigma/q_\sigma)$. (c) Overlay: the two fields are visually indistinguishable. (d) Pointwise $\ell_2$ error heatmap (mean $4.9\times10^{-3}$). Details are provided in Section \ref{['app:score_check']}.
  • Figure 2: Spectral validation. (a) Convergence time $T(k)$ vs. frequency: lines are analytical predictions from Theorem \ref{['thm:general_kernel_timescales']}, markers are measured decay times. The fixed-bandwidth Gaussian exhibits exponential slowdown (Landau damping); the Laplacian kernel yields polynomial scaling; the annealed Gaussian eliminates the bottleneck entirely. (b) Annealing schedules $\sigma(t)$. (c) Total spectral error under different schedules. Details in \ref{['app:spectral_check']}.
  • Figure 3: Loss landscapes with and without stop-gradient, projected onto the top two principal gradient-variation directions. (a,b) Training loss $\|V\|^2$: without SG the minimum is ${\sim}100\times$ deeper. (c,d) Sliced Wasserstein distance: with SG the loss minimum coincides with low distributional error; without SG the deep minimum corresponds to poor sample quality. Red star: trained solution. Details in \ref{['app:stop_grad_check']}.
  • Figure 4: Drift norm vs. distributional distance during training on synthetic 2D targets. (a) Mean drift norm $\|V\|$. (b) Sliced Wasserstein distance. (c) Log--log scatter across training steps and seeds. With stop-gradient (solid), the two quantities are strongly correlated ($r>0.95$) and jointly decay. Without stop-gradient (dashed), the drift norm collapses to ${\sim}10^{-8}$ while the Wasserstein distance remains at $0.389$---a direct demonstration of drift collapse (Theorem \ref{['thm:stopgrad']}\ref{['item:necessity']}). Details in \ref{['app:stop_grad_check']}.
  • Figure 5: Sinkhorn-derived drift (bottom row) vs. Laplacian-kernel drift (top row) on the checkerboard distribution. Training snapshots with drift vectors overlaid; rightmost panels show sliced Wasserstein distance over training. Both converge successfully (final SW $1.42\times10^{-2}$ and $2.07\times10^{-2}$ respectively), demonstrating that the gradient-flow template of §\ref{['sec:principled_drift']} yields practical operators beyond the original kernel family. Details are provided in \ref{['app:sinkhorn_check']}.

Theorems & Definitions (38)

  • Theorem 4.1: Gaussian drift as score difference
  • proof
  • Remark 4.2
  • Proposition 5.1: Identifiability
  • proof
  • Theorem 5.2: Mode-resolved convergence timescales
  • proof
  • Remark 5.3: Landau damping analogy
  • Corollary 5.4: Gaussian vs. Laplacian convergence times
  • proof
  • ...and 28 more