Table of Contents
Fetching ...

Pyramidal Denoising Diffusion Probabilistic Models

Dohoon Ryu, Jong Chul Ye

TL;DR

This work introduces Pyramidal Denoising Diffusion Probabilistic Models (PDDPM), a single-score diffusion framework conditioned on positional encodings that can generate high resolution images from coarse scales and perform multi-scale super-resolution. By training with scale aware coordinates and employing pyramidal reverse sampling along with CCDF acceleration, the approach achieves substantial speedups with a light network while maintaining image quality. Ablation studies validate the importance of positional encoding and patchwise training for very high resolution generation. Overall, PDDPM offers a practical, efficient pathway to fast diffusion based generation and high fidelity super-resolution using one model.

Abstract

Recently, diffusion model have demonstrated impressive image generation performances, and have been extensively studied in various computer vision tasks. Unfortunately, training and evaluating diffusion models consume a lot of time and computational resources. To address this problem, here we present a novel pyramidal diffusion model that can generate high resolution images starting from much coarser resolution images using a {\em single} score function trained with a positional embedding. This enables a neural network to be much lighter and also enables time-efficient image generation without compromising its performances. Furthermore, we show that the proposed approach can be also efficiently used for multi-scale super-resolution problem using a single score function.

Pyramidal Denoising Diffusion Probabilistic Models

TL;DR

This work introduces Pyramidal Denoising Diffusion Probabilistic Models (PDDPM), a single-score diffusion framework conditioned on positional encodings that can generate high resolution images from coarse scales and perform multi-scale super-resolution. By training with scale aware coordinates and employing pyramidal reverse sampling along with CCDF acceleration, the approach achieves substantial speedups with a light network while maintaining image quality. Ablation studies validate the importance of positional encoding and patchwise training for very high resolution generation. Overall, PDDPM offers a practical, efficient pathway to fast diffusion based generation and high fidelity super-resolution using one model.

Abstract

Recently, diffusion model have demonstrated impressive image generation performances, and have been extensively studied in various computer vision tasks. Unfortunately, training and evaluating diffusion models consume a lot of time and computational resources. To address this problem, here we present a novel pyramidal diffusion model that can generate high resolution images starting from much coarser resolution images using a {\em single} score function trained with a positional embedding. This enables a neural network to be much lighter and also enables time-efficient image generation without compromising its performances. Furthermore, we show that the proposed approach can be also efficiently used for multi-scale super-resolution problem using a single score function.
Paper Structure (27 sections, 10 equations, 11 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 10 equations, 11 figures, 4 tables, 2 algorithms.

Figures (11)

  • Figure 1: Progressive image generation from noises using the proposed method trained on FFHQ choi2020stargan dataset. Three different resolution images are generated from noise through reverse diffusion processes using a single model. In red boxes, the preservation of the semantic information at different resolution images is observed.
  • Figure 2: Our training scheme. Two dimensional coordinate information is concatenated with the input image and randomly resized to one of the target resolution. Then, two channels of coordinate values are encoded with the sine and cosine functions, and expanded to $2\times{2}\times{L}$ channels where $L$ is the degree of positional encoding.
  • Figure 3: Proposed inference procedure for (a) image generation and (b) super-resolution. At the lowest resolution, full reverse diffusion is performed, which is then upscaled and forward diffused with additional noise. The CCDF chung2021come acceleration scheme is used as an acceleration scheme. For super-resolution, we imposes constraints in (\ref{['eq:guide']}) at every step of the reverse process.
  • Figure 4: Result of super-resolution on FFHQ and AFHQ-dog dataset. Upper row shows the results of $\times$8 SR and bottom row is $\times$4 result. (a) Ground Truth, (b) low resolution images, the results by (c) cubic interpolation, (d) SRGAN, (e) SR3, (f) ILVR, and (g) the proposed method.
  • Figure 5: Generated images of at full resolution (1024$\times$1024) by our method trained with only 256$\times$256, 512$\times$512 patches. The model had never seen full resolution image.
  • ...and 6 more figures