Table of Contents
Fetching ...

Mean-Shift Distillation for Diffusion Mode Seeking

Vikas Thamizharasan, Nikitas Chatzis, Iliyan Georgiev, Matthew Fisher, Evangelos Kalogerakis, Difan Liu, Nanxuan Zhao, Michal Lukac

TL;DR

Mean-shift distillation (MSD) reframes diffusion distillation as mode-seeking gradient ascent on the data distribution, deriving a gradient proxy that aligns with the modes of $p$. It uses product-density sampling and a simple mean-shift update to estimate the gradient without retraining, serving as a drop-in replacement for SDS. MSD reduces gradient variance and improves mode alignment, yielding higher-fidelity results in text-to-image and text-to-3D generation with Stable Diffusion in both synthetic and practical settings. Practical heuristics stabilize integration in high-dimensional models, and CFG synergy further enhances mode-focused optimization, making MSD a practical, theoretically grounded alternative to SDS.

Abstract

We present mean-shift distillation, a novel diffusion distillation technique that provides a provably good proxy for the gradient of the diffusion output distribution. This is derived directly from mean-shift mode seeking on the distribution, and we show that its extrema are aligned with the modes. We further derive an efficient product distribution sampling procedure to evaluate the gradient. Our method is formulated as a drop-in replacement for score distillation sampling (SDS), requiring neither model retraining nor extensive modification of the sampling procedure. We show that it exhibits superior mode alignment as well as improved convergence in both synthetic and practical setups, yielding higher-fidelity results when applied to both text-to-image and text-to-3D applications with Stable Diffusion.

Mean-Shift Distillation for Diffusion Mode Seeking

TL;DR

Mean-shift distillation (MSD) reframes diffusion distillation as mode-seeking gradient ascent on the data distribution, deriving a gradient proxy that aligns with the modes of . It uses product-density sampling and a simple mean-shift update to estimate the gradient without retraining, serving as a drop-in replacement for SDS. MSD reduces gradient variance and improves mode alignment, yielding higher-fidelity results in text-to-image and text-to-3D generation with Stable Diffusion in both synthetic and practical settings. Practical heuristics stabilize integration in high-dimensional models, and CFG synergy further enhances mode-focused optimization, making MSD a practical, theoretically grounded alternative to SDS.

Abstract

We present mean-shift distillation, a novel diffusion distillation technique that provides a provably good proxy for the gradient of the diffusion output distribution. This is derived directly from mean-shift mode seeking on the distribution, and we show that its extrema are aligned with the modes. We further derive an efficient product distribution sampling procedure to evaluate the gradient. Our method is formulated as a drop-in replacement for score distillation sampling (SDS), requiring neither model retraining nor extensive modification of the sampling procedure. We show that it exhibits superior mode alignment as well as improved convergence in both synthetic and practical setups, yielding higher-fidelity results when applied to both text-to-image and text-to-3D applications with Stable Diffusion.

Paper Structure

This paper contains 32 sections, 18 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Mode-seeking simulated in a fractal-like 2D distribution with two (orange, gray) classes, adapted from karras2024guiding. We compare the behavior of diffusion sampling (DDIM) to optimization-based diffusion distillation, in a class-conditional setting. With class=orange, (a) Ground truth distribution, (b) DDIM sampling , (c) SDS, and (d) our MSD. All methods are run without guidance.
  • Figure 2: We juxtapose diffusion sampling vs diffusion distillation in low-dimensional (${\mathbb{R}}^2$) and high-dimensional (${\mathbb{R}}^{64 \times 64 \times 4}$) setting, using guidance via CFG ho2021classifierfree. Top: (a) text-conditioned generation of image via DDIM with 32 steps, (b) - (e) optimized coordinate-based neural implicit image for SDS, VSD, SDI, and our MSD respectively with StableDiffusion (CFG=7.5, § \ref{['sec:stablediffusion_exps']}). Bottom: (a) class-conditioned generation of 2D points via DDIM with 32 steps, (b) - (e) optimized 2D points for SDS, VSD, SDI, and our MSD respectively (CFG=4, § \ref{['sec:toydist_subsec']}). Text-prompts in clockwise order: "A DSLR photo of a ...hamburger, squirrel dressed as a samurai weighing a katana, knight in silver armor, and bluejay on basket of macarons".
  • Figure 3: Unconditional distillation on two toy density datasets, Pinwheel (top) and Spiral (bottom), given an ideal denoiser ($D^*$) and a learned denoiser ($D_\theta$). For each method and denoiser, we show the optimized samples (left) and the loss landscape (right). Zoom in for clarity.
  • Figure 4: FID vs optimization iterations for text-to-2D generation.
  • Figure 5: Impact of bandwidth ($\lambda$) on the denoised latent ($z_0$). We set $\lambda_3 = 10^3$, $\lambda_2 = 10$, $\lambda_1 = 10^{-2}$. Highlighted images show the optimal bandwidth value corresponding to the $k^{th}$ optimization.
  • ...and 4 more figures