Table of Contents
Fetching ...

One Step Diffusion-based Super-Resolution with Time-Aware Distillation

Xiao He, Huaao Tang, Zhijun Tu, Junchao Zhang, Kun Cheng, Hanting Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, Jie Hu

TL;DR

This work tackles the latency of diffusion-based super-resolution by introducing time-aware diffusion distillation (TAD-SR), enabling high-quality SR in a single sampling step. It combines a teacher-student distillation with a high-frequency enhanced score distillation (HFSD) and a time-aware latent discriminator to guide the student toward real-image manifolds under mild perturbations. The approach yields comparable or superior performance to multi-step teacher models on both synthetic and real-world SR tasks, with notable gains in perceptual quality as shown by non-reference metrics. This method promises practical speedups for diffusion-based SR in real-world imaging applications, including blind face restoration, while maintaining high fidelity to the original content.

Abstract

Diffusion-based image super-resolution (SR) methods have shown promise in reconstructing high-resolution images with fine details from low-resolution counterparts. However, these approaches typically require tens or even hundreds of iterative samplings, resulting in significant latency. Recently, techniques have been devised to enhance the sampling efficiency of diffusion-based SR models via knowledge distillation. Nonetheless, when aligning the knowledge of student and teacher models, these solutions either solely rely on pixel-level loss constraints or neglect the fact that diffusion models prioritize varying levels of information at different time steps. To accomplish effective and efficient image super-resolution, we propose a time-aware diffusion distillation method, named TAD-SR. Specifically, we introduce a novel score distillation strategy to align the data distribution between the outputs of the student and teacher models after minor noise perturbation. This distillation strategy enables the student network to concentrate more on the high-frequency details. Furthermore, to mitigate performance limitations stemming from distillation, we integrate a latent adversarial loss and devise a time-aware discriminator that leverages diffusion priors to effectively distinguish between real images and generated images. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method achieves comparable or even superior performance compared to both previous state-of-the-art (SOTA) methods and the teacher model in just one sampling step. Codes are available at https://github.com/LearningHx/TAD-SR.

One Step Diffusion-based Super-Resolution with Time-Aware Distillation

TL;DR

This work tackles the latency of diffusion-based super-resolution by introducing time-aware diffusion distillation (TAD-SR), enabling high-quality SR in a single sampling step. It combines a teacher-student distillation with a high-frequency enhanced score distillation (HFSD) and a time-aware latent discriminator to guide the student toward real-image manifolds under mild perturbations. The approach yields comparable or superior performance to multi-step teacher models on both synthetic and real-world SR tasks, with notable gains in perceptual quality as shown by non-reference metrics. This method promises practical speedups for diffusion-based SR in real-world imaging applications, including blind face restoration, while maintaining high fidelity to the original content.

Abstract

Diffusion-based image super-resolution (SR) methods have shown promise in reconstructing high-resolution images with fine details from low-resolution counterparts. However, these approaches typically require tens or even hundreds of iterative samplings, resulting in significant latency. Recently, techniques have been devised to enhance the sampling efficiency of diffusion-based SR models via knowledge distillation. Nonetheless, when aligning the knowledge of student and teacher models, these solutions either solely rely on pixel-level loss constraints or neglect the fact that diffusion models prioritize varying levels of information at different time steps. To accomplish effective and efficient image super-resolution, we propose a time-aware diffusion distillation method, named TAD-SR. Specifically, we introduce a novel score distillation strategy to align the data distribution between the outputs of the student and teacher models after minor noise perturbation. This distillation strategy enables the student network to concentrate more on the high-frequency details. Furthermore, to mitigate performance limitations stemming from distillation, we integrate a latent adversarial loss and devise a time-aware discriminator that leverages diffusion priors to effectively distinguish between real images and generated images. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method achieves comparable or even superior performance compared to both previous state-of-the-art (SOTA) methods and the teacher model in just one sampling step. Codes are available at https://github.com/LearningHx/TAD-SR.
Paper Structure (20 sections, 13 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 13 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Qualitative comparisons on one typical real-world example of the proposed method and recent state of the arts, including BSRGAN zhang2021designing, RealESRGAN wang2021real, SwinIR liang2021swinir, DASR liang2022efficient, LDM rombach2022high, StableSR wang2023exploiting, ResShift yue2024resshift, and SinSR wang2023sinsr. We mark the number of sampling steps of diffusion-based SR method with the format of “ Method-A” for more intuitive visualization, where “A” is the number of sampling steps. Note that LDM contains 1000 diffusion steps in training and is accelerated to “A” steps using DDIM song2020denoising during inference. Please zoom in for a better view.
  • Figure 2: We inject varying degrees of noise into the outputs of the student model, the teacher model, and the corresponding real data (HR images). These noisy data are then fed into a pre-trained diffusion-based SR model to obtain their denoising scores(visualized as clean data prediction $\hat{z}_0$).
  • Figure 3: Method overview. We train our student model to map a noisy latent $z_T$ to clean latent $z_0^{stu}$ through single step sampling. To match the student model's output with the multi-step sampling outputs of the teacher model, we optimize the student model using both vanilla distillation loss and our proposed high-frequency enhanced score distillation. Additionally, to further improve the performance of the student model, we propose a time-aware discriminator that provides effective supervision through generative adversarial training.
  • Figure 4: SDS poole2022dreamfusion
  • Figure 5: DMD yin2023one
  • ...and 8 more figures