Table of Contents
Fetching ...

Spectrally-Guided Diffusion Noise Schedules

Carlos Esteves, Ameesh Makadia

Abstract

Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

Spectrally-Guided Diffusion Noise Schedules

Abstract

Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image's spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight'' noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.
Paper Structure (25 sections, 29 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 25 sections, 29 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: Our "tight" schedules adapt to each instance's spectrum, ensuring effective noise levels at all steps. Top: An image with low energy on low frequencies. The standard cosine noise schedule destroys the signal at $t=0.5$, which means that at least half of the training steps would apply excessive noise for this input. Our adaptive schedule preserves the low frequency content -- notice that the object outline is still visible. Bottom: An image with high energy on high frequencies. The cosine schedule barely changes the input at $t=0.1$ -- notice that the RAPSD curves between the cosine schedule and the input are close and correlated. This means that at least $10\%$ of the training steps would apply insufficient noise. Our schedule is effective at destroying a part of the high-frequency content at this level.
  • Figure 2: Our noise schedules vary per instance based on its spectral properties. Left: Median power per frequency for ImageNet at multiple resolutions (increasing from light to dark). The power spectrum of natural images follows a power law whose trends explain current noise schedule tuning heuristics. We eschew such heuristics and use each instance spectrum to determine its schedule. Middle: Cosine schedule and ours for 1000 images from ImageNet $256\times 256$. Right: Median noise schedules for the same set of images, at $128\times 128$, $256\times 256$, and $512\times 512$ (light to dark color). Our schedules avoid excessively high and low noise values, while following similar trends to the baseline across resolutions without any hyperparameter change.
  • Figure 3: Comparison against the SiD2 sid2 baseline on ImageNet, at different number of function evaluations (NFE), or denoising steps. Our model outperforms the baseline at the optimal number of steps, and the gap widens as the number of steps reduces. Interestingly, our "tight" schedules exhibit a slight FID worsening at high number of steps.
  • Figure 4: Samples from ImageNet $256\times256$. Each $2\times 4$ block shows the SiD2 baseline on top and ours on bottom, while the number of denoising steps is, from left to right, 32, 64, 128, and 256. Our generations are noticeably of higher quality at low step counts.
  • Figure 5: Manipulating the sampled spectrum to modify generated image properties. Here we modify the sampled spectrum such that the energy at the highest frequency is multiplied by factors 0.1, 0.2, 0.4, 1.0, 2.5, 5.0, 10.0, respectively. This affects the noise schedule and the model conditioning, so it is a way to guide the model towards different spectral properties. In this example, the energy on high frequencies correlate to the amount of texture and details. Notice how the amount of details increase from left to right. Images are generated by the same model trained on ImageNet $256\times 256$ and same initial noise.
  • ...and 1 more figures