Table of Contents
Fetching ...

AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation

Rui Xie, Chen Zhao, Kai Zhang, Zhenyu Zhang, Jun Zhou, Jian Yang, Ying Tai

TL;DR

This work tackles the high computational cost of diffusion-prior blind super-resolution by introducing StableSR, a framework that freezes a pre-trained diffusion model and trains a lightweight time-aware encoder with SFT to condition the prior. It adds a controllable feature wrapping module to balance realism and fidelity and employs progressive aggregation sampling to handle arbitrary output resolutions. The method achieves superior perceptual restoration on real-world datasets while significantly reducing inference time, especially with SD-Turbo sampling. Overall, StableSR demonstrates that diffusion priors can be efficiently leveraged for high-quality blind SR without full-scale retraining, offering a practical path for diffusion-based restoration in real applications.

Abstract

Blind super-resolution methods based on stable diffusion showcase formidable generative capabilities in reconstructing clear high-resolution images with intricate details from low-resolution inputs. However, their practical applicability is often hampered by poor efficiency, stemming from the requirement of thousands or hundreds of sampling steps. Inspired by the efficient adversarial diffusion distillation (ADD), we design~\name~to address this issue by incorporating the ideas of both distillation and ControlNet. Specifically, we first propose a prediction-based self-refinement strategy to provide high-frequency information in the student model output with marginal additional time cost. Furthermore, we refine the training process by employing HR images, rather than LR images, to regulate the teacher model, providing a more robust constraint for distillation. Second, we introduce a timestep-adaptive ADD to address the perception-distortion imbalance problem introduced by original ADD. Extensive experiments demonstrate our~\name~generates better restoration results, while achieving faster speed than previous SD-based state-of-the-art models (e.g., $7$$\times$ faster than SeeSR).

AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation

TL;DR

This work tackles the high computational cost of diffusion-prior blind super-resolution by introducing StableSR, a framework that freezes a pre-trained diffusion model and trains a lightweight time-aware encoder with SFT to condition the prior. It adds a controllable feature wrapping module to balance realism and fidelity and employs progressive aggregation sampling to handle arbitrary output resolutions. The method achieves superior perceptual restoration on real-world datasets while significantly reducing inference time, especially with SD-Turbo sampling. Overall, StableSR demonstrates that diffusion priors can be efficiently leveraged for high-quality blind SR without full-scale retraining, offering a practical path for diffusion-based restoration in real applications.

Abstract

Blind super-resolution methods based on stable diffusion showcase formidable generative capabilities in reconstructing clear high-resolution images with intricate details from low-resolution inputs. However, their practical applicability is often hampered by poor efficiency, stemming from the requirement of thousands or hundreds of sampling steps. Inspired by the efficient adversarial diffusion distillation (ADD), we design~\name~to address this issue by incorporating the ideas of both distillation and ControlNet. Specifically, we first propose a prediction-based self-refinement strategy to provide high-frequency information in the student model output with marginal additional time cost. Furthermore, we refine the training process by employing HR images, rather than LR images, to regulate the teacher model, providing a more robust constraint for distillation. Second, we introduce a timestep-adaptive ADD to address the perception-distortion imbalance problem introduced by original ADD. Extensive experiments demonstrate our~\name~generates better restoration results, while achieving faster speed than previous SD-based state-of-the-art models (e.g., faster than SeeSR).
Paper Structure (23 sections, 8 equations, 23 figures, 7 tables, 1 algorithm)

This paper contains 23 sections, 8 equations, 23 figures, 7 tables, 1 algorithm.

Figures (23)

  • Figure 1: Qualitative comparisons of BSRGAN zhang2021designing, Real-ESRGAN+ wang2021realesrgan, FeMaSR chen2022femasr, LDM rombach2021highresolution, and our StableSR on real-world examples. (Zoom in for details)
  • Figure 2: Framework of StableSR. We first finetune the time-aware encoder that is attached to a fixed pre-trained Stable Diffusion model. Features are combined with trainable spatial feature transform (SFT) layers. Such a simple yet effective design is capable of leveraging rich diffusion prior for image SR. Then, the diffusion model is fixed. Inspired by CodeFormer zhou2022codeformer, we introduce a controllable feature wrapping (CFW) module to obtain a tuned feature $\bm{F}_m$ in a residual manner, given the additional information $\bm{F}_e$ from LR features and features $\bm{F}_d$ from the fixed decoder. With an adjustable coefficient $w$, CFW can trade between quality and fidelity.
  • Figure 3: In contrast to a conditional encoder without time embedding, the one equipped with time embedding can adaptively supply guidance to the pre-trained diffusion models. (a), we gauge the cosine similarity between the diffusion model's features pre- and post-SFT at various timesteps, which echoes the strength of the condition originating from the encoder. (b), we further visualize the features of the conditional encoder extracted from the LR image. As shown, the encoder is inclined to provide sharp features when the SNR hovers around $5\text{e}^{-2}$. This is precisely when the diffusion model requires substantial guidance to generate the desired high-resolution image content. Interestingly, this observation aligns with the findings in choi2022perception.
  • Figure 4: When dealing with images beyond $512 \times 512$, StableSR (w/o aggregation sampling) suffers from obvious block inconsistency by chopping the image into several tiles, processing them separately, and stitching them together. With our proposed aggregation sampling, StableSR can achieve consistent results on large images. The resolution of the shown figure is $1024 \times 1024$.
  • Figure 5: Qualitative comparisons on several representative real-world samples ($128 \rightarrow 512$). Our StableSR is capable of removing artifacts and generating realistic details. (Zoom in for details)
  • ...and 18 more figures