Table of Contents
Fetching ...

Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-Resolution

Simiao Li, Yun Zhang, Wei Li, Hanting Chen, Wenjia Wang, Bingyi Jing, Shaohui Lin, Jie Hu

TL;DR

This work addresses the challenge of compressing image super-resolution models through knowledge distillation without being tied to a single teacher-student architecture. It introduces MiPKD, a two-granularity KD framework comprising a Feature Prior Mixer and a Block Prior Mixer that fuse teacher and student priors in a shared latent space and via dynamic block level mixing, with a multi-term loss that combines logits, feature, and block distillation signals. Empirical results across CNN and Transformer backbones show MiPKD yields consistent PSNR/SSIM gains over strong baselines, including in compounded depth and width compression scenarios, and ablations highlight the value of separate encoders, the 3D random masking strategy, and the auto-encoder auxiliary loss. The method offers a flexible, architecture-agnostic approach to distilling high quality SR models suitable for deployment on resource constrained devices.

Abstract

Knowledge distillation (KD) is a promising yet challenging model compression technique that transfers rich learning representations from a well-performing but cumbersome teacher model to a compact student model. Previous methods for image super-resolution (SR) mostly compare the feature maps directly or after standardizing the dimensions with basic algebraic operations (e.g. average, dot-product). However, the intrinsic semantic differences among feature maps are overlooked, which are caused by the disparate expressive capacity between the networks. This work presents MiPKD, a multi-granularity mixture of prior KD framework, to facilitate efficient SR model through the feature mixture in a unified latent space and stochastic network block mixture. Extensive experiments demonstrate the effectiveness of the proposed MiPKD method.

Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-Resolution

TL;DR

This work addresses the challenge of compressing image super-resolution models through knowledge distillation without being tied to a single teacher-student architecture. It introduces MiPKD, a two-granularity KD framework comprising a Feature Prior Mixer and a Block Prior Mixer that fuse teacher and student priors in a shared latent space and via dynamic block level mixing, with a multi-term loss that combines logits, feature, and block distillation signals. Empirical results across CNN and Transformer backbones show MiPKD yields consistent PSNR/SSIM gains over strong baselines, including in compounded depth and width compression scenarios, and ablations highlight the value of separate encoders, the 3D random masking strategy, and the auto-encoder auxiliary loss. The method offers a flexible, architecture-agnostic approach to distilling high quality SR models suitable for deployment on resource constrained devices.

Abstract

Knowledge distillation (KD) is a promising yet challenging model compression technique that transfers rich learning representations from a well-performing but cumbersome teacher model to a compact student model. Previous methods for image super-resolution (SR) mostly compare the feature maps directly or after standardizing the dimensions with basic algebraic operations (e.g. average, dot-product). However, the intrinsic semantic differences among feature maps are overlooked, which are caused by the disparate expressive capacity between the networks. This work presents MiPKD, a multi-granularity mixture of prior KD framework, to facilitate efficient SR model through the feature mixture in a unified latent space and stochastic network block mixture. Extensive experiments demonstrate the effectiveness of the proposed MiPKD method.
Paper Structure (10 sections, 9 equations, 3 figures, 11 tables)

This paper contains 10 sections, 9 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: The PSNR of student models on Urban100 testset under different compression settings. In the depth compression (a), there are barely KD methods outperforming vanilla logits-KD. For width compression (b), CSD performs well but only satisfies this setting. For compounded compression, almost all KD underperforms training without KD.
  • Figure 2: Framework of the MiPKD method. MiPKD utilizes the multi-granularity prior mixture to constrain the KD process. The feature prior mixer dynamically combines priors from the teacher and student, and the block prior mixer adopts a coarser-grained prior mixture at the network block level.
  • Figure 3: Visual comparison ($\times$4) with existing SRKD methods from Urban100. the numbers in the bracket denote the PSNR of the presented patches.