Table of Contents
Fetching ...

Exploiting Diffusion Prior for Real-World Image Super-Resolution

Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C. K. Chan, Chen Change Loy

TL;DR

This work introduces StableSR, a practical framework that exploits the diffusion prior for real-world blind super-resolution without retraining large diffusion models. It achieves this by fine-tuning a lightweight time-aware encoder and attaching a controllable feature wrapping module, while freezing the diffusion backbone to preserve generative priors. A progressive aggregation sampling strategy enables SR on arbitrary image sizes, and inference-time strategies like classifier-free guidance and SD-Turbo further boost quality and speed. Across synthetic and real-world benchmarks, StableSR delivers superior perceptual quality and texture fidelity, with ablations confirming the critical roles of the time-aware conditioning, fidelity-realism trade-off, and aggregation sampling. The approach offers a scalable, efficient path to high-quality SR in practical settings, with open-source code and models provided.

Abstract

We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrapping module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.

Exploiting Diffusion Prior for Real-World Image Super-Resolution

TL;DR

This work introduces StableSR, a practical framework that exploits the diffusion prior for real-world blind super-resolution without retraining large diffusion models. It achieves this by fine-tuning a lightweight time-aware encoder and attaching a controllable feature wrapping module, while freezing the diffusion backbone to preserve generative priors. A progressive aggregation sampling strategy enables SR on arbitrary image sizes, and inference-time strategies like classifier-free guidance and SD-Turbo further boost quality and speed. Across synthetic and real-world benchmarks, StableSR delivers superior perceptual quality and texture fidelity, with ablations confirming the critical roles of the time-aware conditioning, fidelity-realism trade-off, and aggregation sampling. The approach offers a scalable, efficient path to high-quality SR in practical settings, with open-source code and models provided.

Abstract

We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrapping module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.
Paper Structure (23 sections, 8 equations, 23 figures, 7 tables, 1 algorithm)

This paper contains 23 sections, 8 equations, 23 figures, 7 tables, 1 algorithm.

Figures (23)

  • Figure 1: Qualitative comparisons of BSRGAN zhang2021designing, Real-ESRGAN+ wang2021realesrgan, FeMaSR chen2022femasr, LDM rombach2021highresolution, and our StableSR on real-world examples. (Zoom in for details)
  • Figure 2: Framework of StableSR. We first finetune the time-aware encoder that is attached to a fixed pre-trained Stable Diffusion model. Features are combined with trainable spatial feature transform (SFT) layers. Such a simple yet effective design is capable of leveraging rich diffusion prior for image SR. Then, the diffusion model is fixed. Inspired by CodeFormer zhou2022codeformer, we introduce a controllable feature wrapping (CFW) module to obtain a tuned feature $\bm{F}_m$ in a residual manner, given the additional information $\bm{F}_e$ from LR features and features $\bm{F}_d$ from the fixed decoder. With an adjustable coefficient $w$, CFW can trade between quality and fidelity.
  • Figure 3: In contrast to a conditional encoder without time embedding, the one equipped with time embedding can adaptively supply guidance to the pre-trained diffusion models. (a), we gauge the cosine similarity between the diffusion model's features pre- and post-SFT at various timesteps, which echoes the strength of the condition originating from the encoder. (b), we further visualize the features of the conditional encoder extracted from the LR image. As shown, the encoder is inclined to provide sharp features when the SNR hovers around $5\text{e}^{-2}$. This is precisely when the diffusion model requires substantial guidance to generate the desired high-resolution image content. Interestingly, this observation aligns with the findings in choi2022perception.
  • Figure 4: When dealing with images beyond $512 \times 512$, StableSR (w/o aggregation sampling) suffers from obvious block inconsistency by chopping the image into several tiles, processing them separately, and stitching them together. With our proposed aggregation sampling, StableSR can achieve consistent results on large images. The resolution of the shown figure is $1024 \times 1024$.
  • Figure 5: Qualitative comparisons on several representative real-world samples ($128 \rightarrow 512$). Our StableSR is capable of removing artifacts and generating realistic details. (Zoom in for details)
  • ...and 18 more figures