Table of Contents
Fetching ...

ControlSR: Taming Diffusion Models for Consistent Real-World Image Super Resolution

Yuhao Wan, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen, Ming-Ming Cheng, Bo Li

TL;DR

ControlSR is presented, a new method that can tame Diffusion Models for consistent real-world image super-resolution (Real-ISR), and can produce higher-quality control signals, which enables the super-resolution results to be more consistent with the LR image and leads to clearer visual results.

Abstract

We present ControlSR, a new method that can tame Diffusion Models for consistent real-world image super-resolution (Real-ISR). Previous Real-ISR models mostly focus on how to activate more generative priors of text-to-image diffusion models to make the output high-resolution (HR) images look better. However, since these methods rely too much on the generative priors, the content of the output images is often inconsistent with the input LR ones. To mitigate the above issue, in this work, we tame Diffusion Models by effectively utilizing LR information to impose stronger constraints on the control signals from ControlNet in the latent space. We show that our method can produce higher-quality control signals, which enables the super-resolution results to be more consistent with the LR image and leads to clearer visual results. In addition, we also propose an inference strategy that imposes constraints in the latent space using LR information, allowing for the simultaneous improvement of fidelity and generative ability. Experiments demonstrate that our model can achieve better performance across multiple metrics on several test sets and generate more consistent SR results with LR images than existing methods. Our code is available at https://github.com/HVision-NKU/ControlSR.

ControlSR: Taming Diffusion Models for Consistent Real-World Image Super Resolution

TL;DR

ControlSR is presented, a new method that can tame Diffusion Models for consistent real-world image super-resolution (Real-ISR), and can produce higher-quality control signals, which enables the super-resolution results to be more consistent with the LR image and leads to clearer visual results.

Abstract

We present ControlSR, a new method that can tame Diffusion Models for consistent real-world image super-resolution (Real-ISR). Previous Real-ISR models mostly focus on how to activate more generative priors of text-to-image diffusion models to make the output high-resolution (HR) images look better. However, since these methods rely too much on the generative priors, the content of the output images is often inconsistent with the input LR ones. To mitigate the above issue, in this work, we tame Diffusion Models by effectively utilizing LR information to impose stronger constraints on the control signals from ControlNet in the latent space. We show that our method can produce higher-quality control signals, which enables the super-resolution results to be more consistent with the LR image and leads to clearer visual results. In addition, we also propose an inference strategy that imposes constraints in the latent space using LR information, allowing for the simultaneous improvement of fidelity and generative ability. Experiments demonstrate that our model can achieve better performance across multiple metrics on several test sets and generate more consistent SR results with LR images than existing methods. Our code is available at https://github.com/HVision-NKU/ControlSR.

Paper Structure

This paper contains 14 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Visual comparisons with recent state-of-the-art Real-ISR methods. Real-ESRGAN Real-ESRGAN results in a lack of generated details. SeeSR SeeSR uses semantic information to activate more generative priors of the SD model but results in inconsistent content with the LR image. Our results can properly generate details and have better visual effects.
  • Figure 2: Analysis of the role of the latent LR embeddings constraint. $D_{kl}$ represents the KL divergence between the control signals and latent LR embeddings. We visualize the control signals with PCA PCA. One can observe that the control signals of ControlNet have higher $D_{kl}$ and cannot preserve the LR information well. However, our results have lower $D_{kl}$ and have sharper outlines, indicating that our model can extract LR information better. Further analysis can be seen in Section \ref{['sec:further']}.
  • Figure 3: Overview of our ControlSR. Our ControlSR consists of the pre-trained Stable Diffusion (SD), the Detail Preserving Module (DPM), and the Global Structure Preserving Module (GSPM). To produce high-quality control signals, we let the LR image pass through the LoRA finetuned VAE Encoder first to obtain latent LR embeddings $\mathbf{x}_{lr}$. Then, we collect the control signals $\mathbf{x}_c = \{\mathbf{c}_1, \mathbf{c}_2, \dots\}$ by inputting $\mathbf{x}_{lr}$ into the DPM and the GSPM and summing their outputs. We feed the control signals into the decoder of SD UNet to control the HR image generation.
  • Figure 4: Power spectrum visualization of the intermediate features. The two images on the left show that the cross-attention layer can increase high-frequency information, and the two images on the right show that DPM contains more high-frequency information than GSPM.
  • Figure 5: Overview of our Latent Space Adjustment strategy. a) shows the average PSNR, MANIQA, and NIQE curves of the DRealSR test set. b) demonstrates our Latent Space Adjustment strategy. c) shows the images at different steps.
  • ...and 3 more figures