Table of Contents
Fetching ...

TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, Changqing Zou

TL;DR

TSD-SR addresses Real-ISR by distilling a pre-trained diffusion prior into a fast one-step model that preserves realistic textures. It introduces Target Score Distillation (TSD), which combines Target Score Matching with the diffusion prior to provide stable, HQ-guided gradients, and a Distribution-Aware Sampling Module (DASM) to emphasize early timesteps for detail recovery. The approach yields superior perceptual restoration with substantially faster inference than prior diffusion-based methods, and ablations confirm the effectiveness of both TSM and DASM. This work advances practical Real-ISR by delivering high-quality results suitable for real-world deployment and sets the stage for future efficiency improvements via model compression.

Abstract

Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. We first introduce the Target Score Distillation, which leverages the priors of diffusion models and real image references to achieve more realistic image restoration. Secondly, we propose a Distribution-Aware Sampling Module to make detail-oriented gradients more readily accessible, addressing the challenge of recovering fine details. Extensive experiments demonstrate that our TSD-SR has superior restoration results (most of the metrics perform the best) and the fastest inference speed (e.g. 40 times faster than SeeSR) compared to the past Real-ISR approaches based on pre-trained diffusion priors.

TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

TL;DR

TSD-SR addresses Real-ISR by distilling a pre-trained diffusion prior into a fast one-step model that preserves realistic textures. It introduces Target Score Distillation (TSD), which combines Target Score Matching with the diffusion prior to provide stable, HQ-guided gradients, and a Distribution-Aware Sampling Module (DASM) to emphasize early timesteps for detail recovery. The approach yields superior perceptual restoration with substantially faster inference than prior diffusion-based methods, and ablations confirm the effectiveness of both TSM and DASM. This work advances practical Real-ISR by delivering high-quality results suitable for real-world deployment and sets the stage for future efficiency improvements via model compression.

Abstract

Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. We first introduce the Target Score Distillation, which leverages the priors of diffusion models and real image references to achieve more realistic image restoration. Secondly, we propose a Distribution-Aware Sampling Module to make detail-oriented gradients more readily accessible, addressing the challenge of recovering fine details. Extensive experiments demonstrate that our TSD-SR has superior restoration results (most of the metrics perform the best) and the fastest inference speed (e.g. 40 times faster than SeeSR) compared to the past Real-ISR approaches based on pre-trained diffusion priors.

Paper Structure

This paper contains 20 sections, 12 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Performance and efficiency comparison among Real-ISR methods. TSD-SR stands out for achieving high-quality restoration with the fastest speed among diffusion-based models. In contrast, existing models prioritize either speed or restoration performance. The performance of each method is benchmarked on an A100 GPU with the DRealSR dataset.
  • Figure 2: Pipeline overview. We train a one-step Student Model $G_\theta$ to transform the low-quality image $x_L$ into a more realistic one. The noisy latent $\boldsymbol{\hat{z}_t}$ sampled by DASM (Details can be found in \ref{['fig:DAMS']}.) will be fed into both the pre-trained Teacher and the LoRA Model to produce the Variational Score Loss. Subsequently, the Teacher’s predictions on $\boldsymbol{\hat{z}_t}$ and $\boldsymbol{z_t}$ yield the Target Score Loss. Their weighted forms, namely TSD (red flow), along with the pixel-space reconstruction loss (green flow), are leveraged to update the Student Model $G_\theta$ . After updating the Student Model, we employ the diffusion loss (blue flow) to update the LoRA Model.
  • Figure 3: A visual comparison of the gradient direction. We set the timestep $t$ to 100 and calculated the cosine similarity between the prediction directions from the Teacher Model and the true direction (towards the HQ data). The prediction direction for $\boldsymbol{z_t}$ closely matches the true direction, but not for $\boldsymbol{\hat{z}_t}$, suggesting that suboptimal samples may lead to directional deviations.
  • Figure 4: The visualization of different strategies. (a) The naive method introduces fake textures and fails to recover fine details. (b) MSE leads to over-smoothed generation results, lacking high-frequency information. (c) Our method offers the superior visual effects and fine textures.
  • Figure 5: (a) The prediction errors of the VSD loss at different timesteps. The error divergence is more pronounced in early timesteps than later. This phenomenon is observed throughout the optimization process. (b) The visualization of Stage 1 prediction error. (c) The visualization of Stage 2 prediction error.
  • ...and 8 more figures