Table of Contents
Fetching ...

One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin

TL;DR

RSD tackles the computational bottleneck of diffusion-based image super-resolution by distilling a ResShift teacher into a one-step generator. It derives a tractable joint-distribution KL objective that leverages a fake ResShift to avoid backpropagating through retraining, and augments with LPIPS and GAN supervision in latent space to boost perceptual fidelity. Empirically, RSD achieves competitive perceptual metrics and fidelity, surpassing the teacher and rivaling state-of-the-art diffusion SR methods on Real-ISR benchmarks with substantially fewer resources. This work makes diffusion-based SR more practical for real-world deployment by delivering high-quality, fast SR on consumer-scale hardware.

Abstract

Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.

One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

TL;DR

RSD tackles the computational bottleneck of diffusion-based image super-resolution by distilling a ResShift teacher into a one-step generator. It derives a tractable joint-distribution KL objective that leverages a fake ResShift to avoid backpropagating through retraining, and augments with LPIPS and GAN supervision in latent space to boost perceptual fidelity. Empirically, RSD achieves competitive perceptual metrics and fidelity, surpassing the teacher and rivaling state-of-the-art diffusion SR methods on Real-ISR benchmarks with substantially fewer resources. This work makes diffusion-based SR more practical for real-world deployment by delivering high-quality, fast SR on consumer-scale hardware.

Abstract

Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.

Paper Structure

This paper contains 28 sections, 1 theorem, 50 equations, 12 figures, 16 tables, 2 algorithms.

Key Result

Proposition 3.1

Given a teacher model $f^*$, loss in eq:main_loss can be evaluated in a tractable form: Here, $f_{\phi}$ is an additional ResShift trained to optimize $\mathcal{L}_{\text{fake}}$ in eq:tractable_objective for estimation of $\mathcal{L}_{\theta}$. Furthermore, minimizing eq:tractable_objective over $\phi$ is equivalent to training a "fake" ResShift using data generated by $G_\theta$.

Figures (12)

  • Figure 1: Left. A comparison between the recent diffusion-based methods for Real-ISR - ResShift, SinSR, OSEDiff, SUPIR - and the proposed RSD method. RSD has the following advantages: (1) It achieves superior perceptual quality compared to SinSR; (2) It requires less computational resources compared to OSEDiff; see Table \ref{['tab:effectiveness']}. ("-N" behind the method name is the NFE, and the value in the bracket is MUSIQ$\uparrow$ for full images). Please zoom in $\times 5$ times for a better view. Right. Comparison among diffusion SR methods on RealSR. RSD (Ours) achieves top scores on most metrics while remaining computationally efficient compared to T2I methods such as OSEDiff and SUPIR.
  • Figure 2: The training framework of RSD. We begin by encoding the (LR, HR) pair into the latent space $(z_y, z_0)$. First, to compute $\mathcal{L}_{\text{LPIPS}}$, we use $z_y$ to sample $z_T$ and generate the output $\widehat{z}_0$ from timestep $T$ (following procedure of one-step inference), then decode it back to pixel space to obtain $\widehat{x}_0$. Then, we obtain $z_{t_n}$ from the forward diffusion process in latent space \ref{['eq:forward_process_resshift']} and generate $\widehat{z}_0^{t_n}$. We then perform posterior sampling \ref{['eq:resshift_posterior']} to obtain $z_t$, process it using both the fake and teacher ResShift models, and compute the distillation losses $\mathcal{L}_{\theta}$ and $\mathcal{L}_{\text{fake}}$ from Proposition \ref{['prop:main-proposition']}. To compute $\mathcal{L}_{\text{GAN}}$, we extract features from the encoder part of the fake model $f_{\phi}$ and use an additional discriminator head.
  • Figure 3: Comparison on Real-ISR (RealSet65, yue2023resshift). Please zoom in $\times 5$ times for a better view.
  • Figure 4: Illustration of the distinct distribution alignment strategies employed by the RSD $\mathcal{L}_{\theta}$ (Ours) and VSD loss functions. We denote by $p^*(x_{0:T}|y_0)$ reverse process of teacher ResShift model and by $p(x_{0:T}|y_0)$ reverse process of ResShift trained on generator $G_{\theta}$ data. The $\mathcal{L}_{\theta}$ loss enforces alignment of the joint distributions $p^*(x_{0:T}|y_0)$ and $p(x_{0:T}|y_0)$ across all timesteps, whereas the VSD loss aligns the marginal distributions at each timestep$t$simultaneously between distributions of teacher ResShift and ResShift trained on generator $G_{\theta}$ data. For formal derivations, see Eqs. \ref{['eq:RSD KL loss']} and \ref{['eq:VSD KL loss']}.
  • Figure 5: Visual results of RSD, ResShift, and SinSR models trained on $512 \times 512$ HR images from LSDIR dataset li2023lsdir and other baselines (Real-ESRGAN, BSRGAN, SUPIR, OSEDiff) on full-size images from RealSR yue2023resshift. Please zoom in for a better view.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • proof : Proof of Proposition \ref{['prop:main-proposition']}