Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-Resolution

Simiao Li; Yun Zhang; Wei Li; Hanting Chen; Wenjia Wang; Bingyi Jing; Shaohui Lin; Jie Hu

Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-Resolution

Simiao Li, Yun Zhang, Wei Li, Hanting Chen, Wenjia Wang, Bingyi Jing, Shaohui Lin, Jie Hu

TL;DR

This work addresses the challenge of compressing image super-resolution models through knowledge distillation without being tied to a single teacher-student architecture. It introduces MiPKD, a two-granularity KD framework comprising a Feature Prior Mixer and a Block Prior Mixer that fuse teacher and student priors in a shared latent space and via dynamic block level mixing, with a multi-term loss that combines logits, feature, and block distillation signals. Empirical results across CNN and Transformer backbones show MiPKD yields consistent PSNR/SSIM gains over strong baselines, including in compounded depth and width compression scenarios, and ablations highlight the value of separate encoders, the 3D random masking strategy, and the auto-encoder auxiliary loss. The method offers a flexible, architecture-agnostic approach to distilling high quality SR models suitable for deployment on resource constrained devices.

Abstract

Knowledge distillation (KD) is a promising yet challenging model compression technique that transfers rich learning representations from a well-performing but cumbersome teacher model to a compact student model. Previous methods for image super-resolution (SR) mostly compare the feature maps directly or after standardizing the dimensions with basic algebraic operations (e.g. average, dot-product). However, the intrinsic semantic differences among feature maps are overlooked, which are caused by the disparate expressive capacity between the networks. This work presents MiPKD, a multi-granularity mixture of prior KD framework, to facilitate efficient SR model through the feature mixture in a unified latent space and stochastic network block mixture. Extensive experiments demonstrate the effectiveness of the proposed MiPKD method.

Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-Resolution

TL;DR

Abstract

Paper Structure (10 sections, 9 equations, 3 figures, 11 tables)

This paper contains 10 sections, 9 equations, 3 figures, 11 tables.

Introduction
Related Work
Methodology
Preliminaries and Notations
Mixture of Prior Knowledge Distillation
Experimental Results
Experiment Setups
Results and Comparison
Ablation Study
Conclusion

Figures (3)

Figure 1: The PSNR of student models on Urban100 testset under different compression settings. In the depth compression (a), there are barely KD methods outperforming vanilla logits-KD. For width compression (b), CSD performs well but only satisfies this setting. For compounded compression, almost all KD underperforms training without KD.
Figure 2: Framework of the MiPKD method. MiPKD utilizes the multi-granularity prior mixture to constrain the KD process. The feature prior mixer dynamically combines priors from the teacher and student, and the block prior mixer adopts a coarser-grained prior mixture at the network block level.
Figure 3: Visual comparison ($\times$4) with existing SRKD methods from Urban100. the numbers in the bracket denote the PSNR of the presented patches.

Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-Resolution

TL;DR

Abstract

Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (3)