Table of Contents
Fetching ...

Efficient Diffusion Model for Image Restoration by Residual Shifting

Zongsheng Yue, Jianyi Wang, Chen Change Loy

TL;DR

This study proposes a novel and efficient diffusion model for IR that significantly reduces the required number of diffusion steps, and establishes a Markov chain that facilitates the transitions between the high-quality and low-quality images by shifting their residuals, substantially improving the transition efficiency.

Abstract

While diffusion-based image restoration (IR) methods have achieved remarkable success, they are still limited by the low inference speed attributed to the necessity of executing hundreds or even thousands of sampling steps. Existing acceleration sampling techniques, though seeking to expedite the process, inevitably sacrifice performance to some extent, resulting in over-blurry restored outcomes. To address this issue, this study proposes a novel and efficient diffusion model for IR that significantly reduces the required number of diffusion steps. Our method avoids the need for post-acceleration during inference, thereby avoiding the associated performance deterioration. Specifically, our proposed method establishes a Markov chain that facilitates the transitions between the high-quality and low-quality images by shifting their residuals, substantially improving the transition efficiency. A carefully formulated noise schedule is devised to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experimental evaluations demonstrate that the proposed method achieves superior or comparable performance to current state-of-the-art methods on three classical IR tasks, namely image super-resolution, image inpainting, and blind face restoration, \textit{\textbf{even only with four sampling steps}}. Our code and model are publicly available at \url{https://github.com/zsyOAOA/ResShift}.

Efficient Diffusion Model for Image Restoration by Residual Shifting

TL;DR

This study proposes a novel and efficient diffusion model for IR that significantly reduces the required number of diffusion steps, and establishes a Markov chain that facilitates the transitions between the high-quality and low-quality images by shifting their residuals, substantially improving the transition efficiency.

Abstract

While diffusion-based image restoration (IR) methods have achieved remarkable success, they are still limited by the low inference speed attributed to the necessity of executing hundreds or even thousands of sampling steps. Existing acceleration sampling techniques, though seeking to expedite the process, inevitably sacrifice performance to some extent, resulting in over-blurry restored outcomes. To address this issue, this study proposes a novel and efficient diffusion model for IR that significantly reduces the required number of diffusion steps. Our method avoids the need for post-acceleration during inference, thereby avoiding the associated performance deterioration. Specifically, our proposed method establishes a Markov chain that facilitates the transitions between the high-quality and low-quality images by shifting their residuals, substantially improving the transition efficiency. A carefully formulated noise schedule is devised to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experimental evaluations demonstrate that the proposed method achieves superior or comparable performance to current state-of-the-art methods on three classical IR tasks, namely image super-resolution, image inpainting, and blind face restoration, \textit{\textbf{even only with four sampling steps}}. Our code and model are publicly available at \url{https://github.com/zsyOAOA/ResShift}.
Paper Structure (30 sections, 21 equations, 25 figures, 13 tables, 2 algorithms)

This paper contains 30 sections, 21 equations, 25 figures, 13 tables, 2 algorithms.

Figures (25)

  • Figure 1: Qualitative comparisons on one typical real-world example of the proposed method and recent state-of-the-arts, including RealESRGAN wang2021real, BSRGAN zhang2021designing, SwinIR liang2021swinir, LDM rombach2022high, StableSR wang2023exploiting, and CCSR sun2023improving. As for the diffusion-based approaches and our proposed method, we annotate the number of sampling steps with the format of "Method-A" for more intuitive visualization, where "A" denotes the number of sampling steps.
  • Figure 2: Overview of the proposed method. Our method builds up a Markov chain between the HQ/LQ image pair by shifting their residuals. To alleviate the computational burden of this transition, it can be optionally moved to the latent space of VQGAN esser2021taming.
  • Figure 3: Illustration of the proposed noise schedule. (a) HQ image. (b) Zoomed LQ image. (c)-(d) Diffused images of the proposed noise schedule in timesteps of 1, 3, 5, 7, 9, 12, and 15 under different values of $\kappa$ by fixing $p=0.3$ and $T=15$. (e)-(f) Diffused images of our method with a specified configuration of $\kappa=40, p=0.8, T=1000$ and LDM rombach2022high in timesteps of 100, 200, 400, 600, 800, 900, and 1000. (g) The relative noise intensity (vertical axes, measured by $\sqrt{1/\lambda_{\text{snr}}}$, where $\lambda_{\text{snr}}$ denotes the signal-to-noise ratio) of the schedules in (d) and (e) w.r.t. the timesteps (horizontal axes). (h) The shifting speed $\sqrt{\eta_t}$ (vertical axes) w.r.t. to the timesteps (horizontal axes) across various configurations of $p$. Note that the diffusion processes in this figure are implemented in the latent space, but we display the intermediate results after decoding back to the image space for the purpose of easy visualization.
  • Figure 4: Visual comparison of two different models containing some self-attention layers (denoted as model-1) or Swin Transformers (denoted as model-2). (a1) and (a2): zoomed LQ images with resolutions of $64\times 64$ or $128\times 128$. (b1) and (b2): super-resolved results by model-1. (c1) and (c2): visualized attention maps extracted from the first self-attention layer of model-1. Note that these visualized results are obtained by first calculating the first principal component of PCA of the attention map and then reshaping it to the targeted size. In the left-upper corner, we annotate the entropy value of these attention maps. (d1) and (d2): super-resolved results by model-2.
  • Figure 5: Ablation studies of our method regarding the perceptual loss.
  • ...and 20 more figures