Table of Contents
Fetching ...

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

Tianyi Zhang, Zheng-Peng Duan, Peng-Tao Jiang, Bo Li, Ming-Ming Cheng, Chun-Le Guo, Chongyi Li

TL;DR

A Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps is proposed, achieving both state-of-the-art performance and controllable SR results with only a single step.

Abstract

Diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance.To achieve efficient Real-ISR, many works employ Variational Score Distillation (VSD) to distill pre-trained stable-diffusion (SD) model for one-step SR with a fixed timestep. However, since SD will perform different generative priors at different timesteps, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance.To address this, we propose a \textbf{T}ime-\textbf{A}ware one-step \textbf{D}iffusion Network for Real-ISR (\textbf{TADSR}). We first introduce a Time-Aware VAE Encoder, which projects the same image into different latent features based on timesteps.Through joint dynamic variation of timesteps and latent features, the student model can better align with the input pattern distribution of the pre-trained SD, thereby enabling more effective utilization of SD's generative capabilities.To better activate the generative prior of SD at different timesteps, we propose a Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps. Additionally, though utilizing the generative prior in SD at different timesteps, our method can naturally achieve \textbf{controllable trade-offs between fidelity and realism} by changing the timestep.Experimental results demonstrate that our method achieves both state-of-the-art performance and controllable SR results with only a single step. The source codes are released at https://github.com/zty557/TADSR

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

TL;DR

A Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps is proposed, achieving both state-of-the-art performance and controllable SR results with only a single step.

Abstract

Diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance.To achieve efficient Real-ISR, many works employ Variational Score Distillation (VSD) to distill pre-trained stable-diffusion (SD) model for one-step SR with a fixed timestep. However, since SD will perform different generative priors at different timesteps, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance.To address this, we propose a \textbf{T}ime-\textbf{A}ware one-step \textbf{D}iffusion Network for Real-ISR (\textbf{TADSR}). We first introduce a Time-Aware VAE Encoder, which projects the same image into different latent features based on timesteps.Through joint dynamic variation of timesteps and latent features, the student model can better align with the input pattern distribution of the pre-trained SD, thereby enabling more effective utilization of SD's generative capabilities.To better activate the generative prior of SD at different timesteps, we propose a Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps. Additionally, though utilizing the generative prior in SD at different timesteps, our method can naturally achieve \textbf{controllable trade-offs between fidelity and realism} by changing the timestep.Experimental results demonstrate that our method achieves both state-of-the-art performance and controllable SR results with only a single step. The source codes are released at https://github.com/zty557/TADSR

Paper Structure

This paper contains 23 sections, 14 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: (a) Comparison between our TADSR(Ours) and PisaSR sun2025pixel. In PisaSR, increasing the semantic weight $\lambda_{sem}$ leads to restore more realistic images. As the timestep condition $t$ increases, our model recovers a more realistic parrot image. In contrast, PisaSR shows only an increase in sharpness as $\lambda_{sem}$ increases. (b) The input image and the corresponding outputs of the SD at different timesteps $t$. The outputs vary significantly across different timesteps, reflecting distinct generative priors.
  • Figure 2: Overview of TADSR. We train a Student Model $G_\theta$ to perform one-step Real-ISR, which consists of a Time-Aware VAE Encoder $E_\theta$ and a UNet $F_\theta$. We randomly sample a timestep $t_s$ and map it to $t_v$. The $t_s$ and the LQ image are fed into the encoder $E_\theta$ to obtain the LQ latent. Then, $t_s$ and the LQ latent are fed into the UNet $F_\theta$ to produce the reconstructed latent feature $\hat{z}_0$. After adding noise to $\hat{z}_0$ corresponding to $t_v$, we feed it and $t_v$ into the teacher model and the LoRA model to compute the TAVSD loss (orange flow). The reconstruction loss (blue flow) in pixel space and TAVSD loss is then used to jointly update the student model $G_\theta$. For the LoRA Model, we employ the diffusion loss (green flow) for training.
  • Figure 3: PCA visualization of latent features produced by TAE under different timesteps $t_s$, and the corresponding mean and standard deviation (Std) of latent features. TAE can encode the same image into distinct latent features conditioned on different timesteps, which aligns with the synchronized variation between timesteps and latent features in the pre-trained SD.
  • Figure 4: (a) Mean and standard deviation (Std) of the VSD loss at different timesteps. (b) The outputs of the teacher model and the LoRA model are decoded into pixel space and gradients in latent space at different timesteps $t$.
  • Figure 5: Visual comparisons between our method and other Real-ISR methods. Please zoom in for a better view.
  • ...and 7 more figures