Table of Contents
Fetching ...

Waving Goodbye to Low-Res: A Diffusion-Wavelet Approach for Image Super-Resolution

Brian Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, Andreas Dengel

TL;DR

By enabling the diffusion model to operate in the frequency domain, these models effectively hallucinate highfrequency information for SR images on the wavelet spectrum, resulting in high-quality and detailed reconstructions in image space.

Abstract

This paper presents a novel Diffusion-Wavelet (DiWa) approach for Single-Image Super-Resolution (SISR). It leverages the strengths of Denoising Diffusion Probabilistic Models (DDPMs) and Discrete Wavelet Transformation (DWT). By enabling DDPMs to operate in the DWT domain, our DDPM models effectively hallucinate high-frequency information for super-resolved images on the wavelet spectrum, resulting in high-quality and detailed reconstructions in image space. Quantitatively, we outperform state-of-the-art diffusion-based SISR methods, namely SR3 and SRDiff, regarding PSNR, SSIM, and LPIPS on both face (8x scaling) and general (4x scaling) SR benchmarks. Meanwhile, using DWT enabled us to use fewer parameters than the compared models: 92M parameters instead of 550M compared to SR3 and 9.3M instead of 12M compared to SRDiff. Additionally, our method outperforms other state-of-the-art generative methods on classical general SR datasets while saving inference time. Finally, our work highlights its potential for various applications.

Waving Goodbye to Low-Res: A Diffusion-Wavelet Approach for Image Super-Resolution

TL;DR

By enabling the diffusion model to operate in the frequency domain, these models effectively hallucinate highfrequency information for SR images on the wavelet spectrum, resulting in high-quality and detailed reconstructions in image space.

Abstract

This paper presents a novel Diffusion-Wavelet (DiWa) approach for Single-Image Super-Resolution (SISR). It leverages the strengths of Denoising Diffusion Probabilistic Models (DDPMs) and Discrete Wavelet Transformation (DWT). By enabling DDPMs to operate in the DWT domain, our DDPM models effectively hallucinate high-frequency information for super-resolved images on the wavelet spectrum, resulting in high-quality and detailed reconstructions in image space. Quantitatively, we outperform state-of-the-art diffusion-based SISR methods, namely SR3 and SRDiff, regarding PSNR, SSIM, and LPIPS on both face (8x scaling) and general (4x scaling) SR benchmarks. Meanwhile, using DWT enabled us to use fewer parameters than the compared models: 92M parameters instead of 550M compared to SR3 and 9.3M instead of 12M compared to SRDiff. Additionally, our method outperforms other state-of-the-art generative methods on classical general SR datasets while saving inference time. Finally, our work highlights its potential for various applications.
Paper Structure (19 sections, 9 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 19 sections, 9 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of training. The diffusion process takes the difference between the initial predictor and the corresponding HR image as input. The trained reverse process learns to denoise the noisy residual image with the difference between the reconstruction of the initial predictor and the corresponding HR image as the optimization target.
  • Figure 2: Overview of inference. The image is first decomposed into sub-bands using 2D-DWT, which an initial predictor processes. The denoise function then adds and computes the remaining details by conditioning on the sub-bands, incorporating noise. The result is then returned to the pixel domain via the inverse 2D-DWT function.
  • Figure 3: A Comparison of a LR, SR, and HR image (CelebA-HQ) illustrates the quality of our proposed method for the $16\times16 \rightarrow 128\times128$ setting. The LR image shows a significant loss of information, particularly the presence of a finger in front of the mouth. Our proposed method can reconstruct the image with great detail, particularly in the hair. However, the HR image shows that our method cannot reconstruct the finger. Also, the HR image shows more defined edges and sharper details of the eyes.
  • Figure 4: LR, HR, and SR (our method) example results of the $64\times64 \rightarrow 512\times512$ experiments for three different face images (CelebA-HQ). The last row shows that our model produces continuous skin texture, which does not match the small details of the ground truth, such as moles and pimples.
  • Figure 5: Intermediate denoising results were obtained with our approach on face super-resolution ($64\times64 \rightarrow 512\times512$). The top left image represents the LR input. The middle image in the first row is the estimation of our initial predictor. The remaining images show the intermediate denoising estimations from our denoising function as we apply it iteratively, progressing from left to right and top to bottom. The final prediction of the denoising function is in the lower right corner of the grid.
  • ...and 2 more figures