Table of Contents
Fetching ...

Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation

Tianyu Chen, Yasi Zhang, Zhendong Wang, Ying Nian Wu, Oscar Leong, Mingyuan Zhou

TL;DR

This work addresses learning high-quality generative models when clean data are scarce by introducing denoising score distillation (DSD), which first pretrains a diffusion model on noisy data and then distills it into a one-step generator. The authors show that distillation can improve sample quality even when the teacher is degraded and provide a linear-theory justification that the distilled model aligns with the clean data covariance eigenspace, effectively regularizing the generator. Empirically, DSD yields strong, faster-generation performance across multiple datasets and noise levels, and the paper introduces practical tools such as Proximal FID for model selection in corrupted-data regimes. Theoretical and experimental results together suggest that noisy data, when processed via score distillation, can be more informative than previously believed, enabling robust sampling in scientific domains with limited clean data.

Abstract

Diffusion models have achieved remarkable success in generating high-resolution, realistic images across diverse natural distributions. However, their performance heavily relies on high-quality training data, making it challenging to learn meaningful distributions from corrupted samples. This limitation restricts their applicability in scientific domains where clean data is scarce or costly to obtain. In this work, we introduce denoising score distillation (DSD), a surprisingly effective and novel approach for training high-quality generative models from low-quality data. DSD first pretrains a diffusion model exclusively on noisy, corrupted samples and then distills it into a one-step generator capable of producing refined, clean outputs. While score distillation is traditionally viewed as a method to accelerate diffusion models, we show that it can also significantly enhance sample quality, particularly when starting from a degraded teacher model. Across varying noise levels and datasets, DSD consistently improves generative performancewe summarize our empirical evidence in Fig. 1. Furthermore, we provide theoretical insights showing that, in a linear model setting, DSD identifies the eigenspace of the clean data distributions covariance matrix, implicitly regularizing the generator. This perspective reframes score distillation as not only a tool for efficiency but also a mechanism for improving generative models, particularly in low-quality data settings.

Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation

TL;DR

This work addresses learning high-quality generative models when clean data are scarce by introducing denoising score distillation (DSD), which first pretrains a diffusion model on noisy data and then distills it into a one-step generator. The authors show that distillation can improve sample quality even when the teacher is degraded and provide a linear-theory justification that the distilled model aligns with the clean data covariance eigenspace, effectively regularizing the generator. Empirically, DSD yields strong, faster-generation performance across multiple datasets and noise levels, and the paper introduces practical tools such as Proximal FID for model selection in corrupted-data regimes. Theoretical and experimental results together suggest that noisy data, when processed via score distillation, can be more informative than previously believed, enabling robust sampling in scientific domains with limited clean data.

Abstract

Diffusion models have achieved remarkable success in generating high-resolution, realistic images across diverse natural distributions. However, their performance heavily relies on high-quality training data, making it challenging to learn meaningful distributions from corrupted samples. This limitation restricts their applicability in scientific domains where clean data is scarce or costly to obtain. In this work, we introduce denoising score distillation (DSD), a surprisingly effective and novel approach for training high-quality generative models from low-quality data. DSD first pretrains a diffusion model exclusively on noisy, corrupted samples and then distills it into a one-step generator capable of producing refined, clean outputs. While score distillation is traditionally viewed as a method to accelerate diffusion models, we show that it can also significantly enhance sample quality, particularly when starting from a degraded teacher model. Across varying noise levels and datasets, DSD consistently improves generative performancewe summarize our empirical evidence in Fig. 1. Furthermore, we provide theoretical insights showing that, in a linear model setting, DSD identifies the eigenspace of the clean data distributions covariance matrix, implicitly regularizing the generator. This perspective reframes score distillation as not only a tool for efficiency but also a mechanism for improving generative models, particularly in low-quality data settings.

Paper Structure

This paper contains 35 sections, 6 theorems, 47 equations, 16 figures, 8 tables, 3 algorithms.

Key Result

Theorem 1

Fix $\sigma > 0$. Under Assumptions assump:linear, aasump:perfect_score, and assump:low_rank_generator, consider the family of parameters $\theta = (U, V)$ such that For any bounded noise schedule $(\sigma_t) \subseteq[\sigma_{\min},\sigma_{\max}]$, the global minimizers of $\mathcal{L}$ (Eq. eq:ideal-score-loss-r) over $\Theta$, denoted by $\theta^*_{\sigma} := (U^*,V^*_{\sigma})$, satisfy the f

Figures (16)

  • Figure 2: Qualitative results of DSD (ours, one-step) at $\sigma=0.2$. While only corrupted images are available during training, DSD is capable of producing refined, clean samples. The left two panels are from CIFAR-10, while the right two are from CelebA-HQ. Zoom in for better viewing.
  • Figure 3: Ablation on diffusion objectives and generator losses. The adjusted diffusion objective leads to excellent performance, while the Fisher divergence with SiD-based gradient estimation helps stabilize the distillation process.
  • Figure 4: A toy example of learning from a noisy dataset with $\sigma=0.05$. Teacher diffusion models such as Ambient-Full and Ambient Truncated tend to force the approximating distribution to spread out its probability mass to cover all regions. DSD excels at denoising the original dataset, demonstrating the implicit regularization effects brought by distillation.
  • Figure 5: Evolution of FIDs and Proximal FIDs on D-SiD. Proximal FID aligns well with FID throughout the distillation process.
  • Figure 6: A toy example illustrating the impact of tuning $\sigma$. We use a noisy training dataset with $\sigma=0.05$. During pretraining and distillation, we experiment with different values of $\hat{\sigma}$, representing underestimation, accurate estimation, and overestimation. A slight overestimation of the noise level tends to increase regularization strength, helping the generated data better adhere to the data manifold.
  • ...and 11 more figures

Theorems & Definitions (9)

  • Theorem 1
  • proof : Proof sketch of Theorem \ref{['thm:time-dependent-theorem']}
  • Lemma 1
  • Lemma 2
  • Lemma 3: mirsky1975trace
  • Lemma 4
  • proof : Proof of Lemma \ref{['lem:pca-like-result']}
  • Lemma 5
  • proof : Proof of Lemma \ref{['lem:eigenvalue-time-dependent-function']}