Data Upcycling Knowledge Distillation for Image Super-Resolution

Yun Zhang; Wei Li; Simiao Li; Hanting Chen; Zhijun Tu; Wenjia Wang; Bingyi Jing; Shaohui Lin; Jie Hu

Data Upcycling Knowledge Distillation for Image Super-Resolution

Yun Zhang, Wei Li, Simiao Li, Hanting Chen, Zhijun Tu, Wenjia Wang, Bingyi Jing, Shaohui Lin, Jie Hu

TL;DR

Data Upcycling Knowledge Distillation (DUKD) targets the core limitation in SR KD: teacher outputs are noisy approximations to the ground-truth distribution, which weakens the transfer of knowledge to the student. DUKD introduces two components—in-domain data upcycling and label consistency regularization—augmenting KD with upcycled data and invertible perturbations, and forms a total loss $\mathcal{L} = \mathcal{L}_{rec} + \lambda_{kd}\mathcal{L}_{kd} + \lambda_{dukd}\mathcal{L}_{dukd}$ to guide learning. By supervising on upcycled LR/HR pairs and enforcing $\mathcal{F}^{-1}$-invariant outputs through compressible transforms, DUKD achieves stronger performance than existing KD methods across SR backbones (e.g., EDSR, RCAN, SwinIR) and scales, including cross-architecture teacher-student pairs and real-world SR benchmarks. This data-centric KD approach reduces dependence on GT-aligned teacher signals, improves robustness via label consistency, and offers a practical path to deploying high-quality SR models on resource-constrained devices, with code released for reproducibility.

Abstract

Knowledge distillation (KD) compresses deep neural networks by transferring task-related knowledge from cumbersome pre-trained teacher models to compact student models. However, current KD methods for super-resolution (SR) networks overlook the nature of SR task that the outputs of the teacher model are noisy approximations to the ground-truth distribution of high-quality images (GT), which shades the teacher model's knowledge to result in limited KD effects. To utilize the teacher model beyond the GT upper-bound, we present the Data Upcycling Knowledge Distillation (DUKD), to transfer the teacher model's knowledge to the student model through the upcycled in-domain data derived from training data. Besides, we impose label consistency regularization to KD for SR by the paired invertible augmentations to improve the student model's performance and robustness. Comprehensive experiments demonstrate that the DUKD method significantly outperforms previous arts on several SR tasks.

Data Upcycling Knowledge Distillation for Image Super-Resolution

TL;DR

to guide learning. By supervising on upcycled LR/HR pairs and enforcing

-invariant outputs through compressible transforms, DUKD achieves stronger performance than existing KD methods across SR backbones (e.g., EDSR, RCAN, SwinIR) and scales, including cross-architecture teacher-student pairs and real-world SR benchmarks. This data-centric KD approach reduces dependence on GT-aligned teacher signals, improves robustness via label consistency, and offers a practical path to deploying high-quality SR models on resource-constrained devices, with code released for reproducibility.

Abstract

Paper Structure (15 sections, 6 equations, 6 figures, 10 tables)

This paper contains 15 sections, 6 equations, 6 figures, 10 tables.

Introduction
Related Works
Image Super-Resolution
Knowledge Distillation
Methodology
Notations and Preliminaries
Motivation
Data Upcycling
Label Consistency Regularization
Difference with Data Augmentation
Experiments
Experiment Setups
Results and Comparison
Ablation Analysis
Conclusion

Figures (6)

Figure 1: Framework of the DUKD method. It facilitates the student with the prior knowledge provided by the teacher through upcycled in-domain data. The label consistency regularization enhances the generalizability of the student. Besides $\mathcal{L}_{dukd}$, the total loss also includes the conventional $\mathcal{L}_{rec}$ and $\mathcal{L}_{kd}$, which are omitted for simplicity.
Figure 2: Similarity between the student and teacher ×4 EDSR models over different training approaches (indicated by the x-axis labels). PSNR(S,T) denotes the average PSNR between student and teacher models' outputs, with larger values reflecting higher similarity. PSNR(S,GT) denotes the average PSNR between the output of the student model and ground-truth HR image, with larger values showing better fitting (left: training set) or generalization performance (right: testing set).
Figure 3: Comparison between the logits-KD, Data Free KD, and DUKD.
Figure 4: Comparison of the label consistency regularization in high-level CV and KD for SR. The augmentations should be invertible to make the models' output comparable.
Figure 5: The ×4 SR examples of EDSR models on img004, img019, img089 and img096 from Urban100. PSNRs (dB) of the cropped regions are annotated below each image.
...and 1 more figures

Data Upcycling Knowledge Distillation for Image Super-Resolution

TL;DR

Abstract

Data Upcycling Knowledge Distillation for Image Super-Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (6)