Table of Contents
Fetching ...

Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering

Sixian Wang, Zhiwei Tang, Tsung-Hui Chang

TL;DR

This work tackles the problem of inconsistent image quality in diffusion models caused by stochastic sampling. It uncovers a strong link between sample quality and Accumulated Score Differences (ASD) during classifier-free guidance and proposes CFG-Rejection, a plug-and-play, reward-free method that prunes low-potential denoising trajectories early using a partial ASD measure $\mathcal{E}_{\tau:T}(c)$ with a threshold $\gamma$. The approach requires no architectural changes or retraining and integrates with existing diffusion pipelines, yielding consistent improvements in human and automated quality metrics across ImageNet, GenEval, DPG-Bench, and visual-text tasks. Through extensive experiments, the paper demonstrates substantial compute savings and quality gains, suggesting broad applicability of ASD-based latent-space filtering beyond images and highlighting a practical, zero-cost enhancement for diffusion-based generation.

Abstract

Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although training-based fine-tuning (e.g. DDPO [1]) and inference-time alignment techniques[2] aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)--the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.

Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering

TL;DR

This work tackles the problem of inconsistent image quality in diffusion models caused by stochastic sampling. It uncovers a strong link between sample quality and Accumulated Score Differences (ASD) during classifier-free guidance and proposes CFG-Rejection, a plug-and-play, reward-free method that prunes low-potential denoising trajectories early using a partial ASD measure with a threshold . The approach requires no architectural changes or retraining and integrates with existing diffusion pipelines, yielding consistent improvements in human and automated quality metrics across ImageNet, GenEval, DPG-Bench, and visual-text tasks. Through extensive experiments, the paper demonstrates substantial compute savings and quality gains, suggesting broad applicability of ASD-based latent-space filtering beyond images and highlighting a practical, zero-cost enhancement for diffusion-based generation.

Abstract

Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although training-based fine-tuning (e.g. DDPO [1]) and inference-time alignment techniques[2] aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)--the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.

Paper Structure

This paper contains 30 sections, 8 equations, 23 figures, 5 tables.

Figures (23)

  • Figure 1: Illustration of filtering framework. Best-of-N completes all denoising steps, using an external reward model to select the high-quality image, while our method halts low-quality generations early with the intrinsic information in the sampling path.
  • Figure 2: The qualitative comparison of filtering results demonstrates the effectiveness of our method in the text alignment of complex prompts. Prompt: "A night sky with constellations forming the words 'Among the stars, we find our dreams and destiny'". Low-ASD images (top row) exhibit completely missing strokes, while high-ASD samples (bottom row) ensure textual requirements.
  • Figure 3: A fractal-like 2D distribution with two classes (gray and orange). (a) Samples generated with CFG ($\omega=2$) are color-coded by ASD. High-ASD samples concentrate in the dense trunk. (b) A log-linear trend emerges between local density and ASD, indicating that large score differences align with high-likelihood regions.
  • Figure 4: Density estimation curves for samples with varying accumulated score differences. The top and bottom rows display the AvgkNN and LOF density profiles, respectively, across three distinct labels. As the ASD decreases from rank 0 (highest) to rank 3 (lowest), we observe a systematic shift of samples from high-density to low-density regions.
  • Figure 5: Qualitative comparison on the ImageNet dataset. (Top) Baseline samples with the lowest $\mathcal{E}_{T}(c)$ exhibit artifacts and misalignment. (Bottom) Our selected samples with the highest $\mathcal{E}_{T}(c)$ show better fidelity and prompt adherence.
  • ...and 18 more figures