Table of Contents
Fetching ...

ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution

Shuyao Shang, Zhengyang Shan, Guangxing Liu, LunQian Wang, XingHua Wang, Zekai Zhang, Jinglin Zhang

TL;DR

ResDiff addresses the inefficiency of diffusion-based SISR by fusing a lightweight pre-trained CNN that recovers the main low-frequency content with a diffusion model that learns the residual in the CNN space. It introduces a frequency-domain strategy, including a CNN loss with FFT and DWT components and a frequency-guided diffusion comprising an FD Info Splitter and HF-guided Cross-Attention to emphasize high-frequency details. Empirical results on FFHQ, CelebA, DIV2K, and Urban100 show faster convergence and improved sample quality, with stronger PSNR/SSIM and lower FID compared to previous diffusion-based methods, while achieving diverse outputs. The approach offers a practical route to efficient, high-fidelity SISR and can be extended to other restoration tasks, with future work focusing on computational optimization and color consistency.

Abstract

Adapting the Diffusion Probabilistic Model (DPM) for direct image super-resolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN predicted image. In contrast to the common diffusion-based methods that directly use LR images to guide the noise towards HR space, ResDiff utilizes the CNN's initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples.

ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution

TL;DR

ResDiff addresses the inefficiency of diffusion-based SISR by fusing a lightweight pre-trained CNN that recovers the main low-frequency content with a diffusion model that learns the residual in the CNN space. It introduces a frequency-domain strategy, including a CNN loss with FFT and DWT components and a frequency-guided diffusion comprising an FD Info Splitter and HF-guided Cross-Attention to emphasize high-frequency details. Empirical results on FFHQ, CelebA, DIV2K, and Urban100 show faster convergence and improved sample quality, with stronger PSNR/SSIM and lower FID compared to previous diffusion-based methods, while achieving diverse outputs. The approach offers a practical route to efficient, high-fidelity SISR and can be extended to other restoration tasks, with future work focusing on computational optimization and color consistency.

Abstract

Adapting the Diffusion Probabilistic Model (DPM) for direct image super-resolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN predicted image. In contrast to the common diffusion-based methods that directly use LR images to guide the noise towards HR space, ResDiff utilizes the CNN's initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples.
Paper Structure (17 sections, 15 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 15 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overall struture of proposed ResDiff.
  • Figure 2: Comparison of different generation processes. In contrast to (a) sr3, (b) srdiff, (c) deblur where only LR Space is used to guide the generation, our ResDiff (d) makes full utilization of CNN Prediction Space and High-Frequency Space to guide a faster and better generation.
  • Figure 3: Depiction of the three loss functions utilized in CNN pre-training. A spatial domain loss (GT Loss) and two frequency domain losses (FFT Loss and DWT Loss) are computed.
  • Figure 4: An overview of the model architecture in proposed FD-guided diffusion. The pre-trained CNN prediction and the noisy image $x_t$ from step $t$ are fed into the FD-info-Splitter, and its output is then passed on to a U-net, which is equipped with HF-guided cross-attention.
  • Figure 5: DIV2k 4$\times$ results. Note that ResDiff provides richer details and more natural textures than other diffusion-based methods for the recovery of small objects (e.g., the clock in the first column) and difficult scenes (e.g., the bridge structure in the second column, the building in the fourth column).