Table of Contents
Fetching ...

How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

Giannis Daras, Yeshwanth Cherapanamjeri, Constantinos Daskalakis

TL;DR

It is shown that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data, and a theoretical model suggests that the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample.

Abstract

The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than $80$ models on data with different corruption levels across three datasets ranging from $30,000$ to $\approx 1.3$M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~$10\%$ of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments.

How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

TL;DR

It is shown that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data, and a theoretical model suggests that the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample.

Abstract

The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than models on data with different corruption levels across three datasets ranging from to M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~ of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments.

Paper Structure

This paper contains 36 sections, 26 theorems, 124 equations, 4 figures, 3 tables, 8 algorithms.

Key Result

Lemma 2.1

Let ${\bm{X}}_{t_n} = {\bm{X}}_0 + \sigma_{t_n}{\bm{Z}}_1$ and ${\bm{X}}_t = {\bm{X}}_0 + \sigma_t {\bm{Z}}_2, \quad {\bm{Z}}_1, {\bm{Z}}_2\sim \mathcal{N}(\bm{0}, I)$ i.i.d. Then, for any $\sigma_t > \sigma_{t_n}$, we have that:

Figures (4)

  • Figure 1: Evaluation of training and sampling improvements for models trained with noisy data.
  • Figure 2: Dataset images with varying noise levels ($\sigma$).
  • Figure 3: FID as a function of training steps for a model trained with a mix of clean and noisy data. FID continues to go down as we train more and more, indicating that the model at 200K iterations is still undertrained.
  • Figure : Denoised Method of Moments with Heterogenous Variances

Theorems & Definitions (44)

  • Lemma 2.1: daras2024consistent
  • Definition 4.1
  • Theorem 4.2: Upper bound
  • Corollary 4.3
  • Theorem 4.4: Lower bound
  • Lemma 4.1
  • Lemma 4.5: wygaussian
  • Theorem 4.6
  • proof
  • Proposition B.1
  • ...and 34 more