Table of Contents
Fetching ...

Co-learning Single-Step Diffusion Upsampler and Downsampler with Two Discriminators and Distillation

Sohwi Kim, Tae-Kyun Kim

TL;DR

This paper tackles real-world SR where degradations are unknown and diverse. It introduces a co-learning framework that jointly trains a single-step diffusion-based upsampler and a learnable diffusion-based downsampler, guided by two discriminators and cyclic distillation to model both HR and LR domains. The approach achieves state-of-the-art or competitive results on Real-ISR and FFHQ face SR, with efficient single-step inference and robust handling of real degradations. The work demonstrates that diffusion-based downsampling, coupled with adversarial guidance and distillation, can bridge synthetic and real-world SR gaps and enable practical, high-quality SR in real-time settings.

Abstract

Super-resolution (SR) aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts, often relying on effective downsampling to generate diverse and realistic training pairs. In this work, we propose a co-learning framework that jointly optimizes a single-step diffusion-based upsampler and a learnable downsampler, enhanced by two discriminators and a cyclic distillation strategy. Our learnable downsampler is designed to better capture realistic degradation patterns while preserving structural details in the LR domain, which is crucial for enhancing SR performance. By leveraging a diffusion-based approach, our model generates diverse LR-HR pairs during training, enabling robust learning across varying degradations. We demonstrate the effectiveness of our method on both general real-world and domain-specific face SR tasks, achieving state-of-the-art performance in both fidelity and perceptual quality. Our approach not only improves efficiency with a single inference step but also ensures high-quality image reconstruction, bridging the gap between synthetic and real-world SR scenarios.

Co-learning Single-Step Diffusion Upsampler and Downsampler with Two Discriminators and Distillation

TL;DR

This paper tackles real-world SR where degradations are unknown and diverse. It introduces a co-learning framework that jointly trains a single-step diffusion-based upsampler and a learnable diffusion-based downsampler, guided by two discriminators and cyclic distillation to model both HR and LR domains. The approach achieves state-of-the-art or competitive results on Real-ISR and FFHQ face SR, with efficient single-step inference and robust handling of real degradations. The work demonstrates that diffusion-based downsampling, coupled with adversarial guidance and distillation, can bridge synthetic and real-world SR gaps and enable practical, high-quality SR in real-time settings.

Abstract

Super-resolution (SR) aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts, often relying on effective downsampling to generate diverse and realistic training pairs. In this work, we propose a co-learning framework that jointly optimizes a single-step diffusion-based upsampler and a learnable downsampler, enhanced by two discriminators and a cyclic distillation strategy. Our learnable downsampler is designed to better capture realistic degradation patterns while preserving structural details in the LR domain, which is crucial for enhancing SR performance. By leveraging a diffusion-based approach, our model generates diverse LR-HR pairs during training, enabling robust learning across varying degradations. We demonstrate the effectiveness of our method on both general real-world and domain-specific face SR tasks, achieving state-of-the-art performance in both fidelity and perceptual quality. Our approach not only improves efficiency with a single inference step but also ensures high-quality image reconstruction, bridging the gap between synthetic and real-world SR scenarios.

Paper Structure

This paper contains 28 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Both the student network (low-to-high) and the downsampler (high-to-low) are diffusion-based architectures. In the latent space, the output of the student network is conditioned and fed into the downsampler. The output of the downsampler $\hat{y}$ is then compared with the original low-resolution $y$ during training.
  • Figure 2: The overall framework of our model. The student network $f_{\phi}$ is trained to learn a deterministic mapping from $x_T$ to $\hat{x}_0$ in just one step, guided by a pre-trained teacher network $f_\theta$. The student's output $\hat{x}_\phi(x_s,y,s)$ then goes to the High-Resolution Discriminator $\mathcal{D}_H$. Simultaneously, $\hat{x}_\phi(x_s,y,s)$ is jointly learnt with a Learnable Downsampler $G$ and Low-Resolution Discriminator $\mathcal{D}_L$ in end-to-end fashion.
  • Figure 3: Visual comparisons on real-world datasets[RealSR, RealSet65]. Zoom in for more details.
  • Figure 4: Visual comparisons on DIV2K-Val dataset. Zoom in for more details.
  • Figure 5: Comparisons on FFHQ dataset where 64$\times$64 inputs are upscaled to 256$\times$256 high-resolution outputs(4$\times$). Our results are compared with SinSR, a single-step diffusion-based model, and the ground-truth(GT).