Table of Contents
Fetching ...

Soft Knowledge Distillation with Multi-Dimensional Cross-Net Attention for Image Restoration Models Compression

Yongheng Zhang, Danfeng Yan

TL;DR

Edge devices require efficient image restoration, but transformer-based models are computationally heavy. The paper introduces Soft Knowledge Distillation (SKD) with Multi-dimensional Cross-net Attention (MCA) to let a lightweight student absorb teacher attention across channel and spatial dimensions, using a Gaussian kernel space distance and an image-level contrastive loss. MCA updates include $S_{fc}^i = softmax(T_f^i (S_f^i)^T / \lambda) S_f^i$ and $S_{ft}^i = S_f^i \; softmax((T_f^i)^T S_f^i / \lambda)$ with GK-based loss $L_{GK} = GK(S_f^i, T_f^i) + \alpha_1(GK(S_{fc}^i, T_f^i) + GK(S_{ft}^i, T_f^i))$ where $GK(x,y) = 1 - \exp(-||x - y||^2 /(2 \sigma^2))$. The overall objective combines reconstruction, kernel alignment, and contrastive terms: $L = L_{REC} + \alpha_2 L_{GK} + \alpha_3 L_{CL}$ with $L_{REC} = ||G - S_r||_1$. The results show substantial compute reductions with preserved restoration quality across deraining, deblurring, and denoising tasks, enabling practical deployment on edge devices.

Abstract

Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.

Soft Knowledge Distillation with Multi-Dimensional Cross-Net Attention for Image Restoration Models Compression

TL;DR

Edge devices require efficient image restoration, but transformer-based models are computationally heavy. The paper introduces Soft Knowledge Distillation (SKD) with Multi-dimensional Cross-net Attention (MCA) to let a lightweight student absorb teacher attention across channel and spatial dimensions, using a Gaussian kernel space distance and an image-level contrastive loss. MCA updates include and with GK-based loss where . The overall objective combines reconstruction, kernel alignment, and contrastive terms: with . The results show substantial compute reductions with preserved restoration quality across deraining, deblurring, and denoising tasks, enabling practical deployment on edge devices.

Abstract

Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
Paper Structure (11 sections, 6 equations, 6 figures, 3 tables)

This paper contains 11 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: PSNR$\uparrow$ vs. FLOPs$\downarrow$ denoising results on SIDD.
  • Figure 2: The overall architecture of the proposed Soft Knowledge Distillation (SKD) for image restoration models compression.
  • Figure 3: Qualitative results of knowledge distillation methods.
  • Figure 4: Qualitative comparison with light-weight methods.
  • Figure 5: Deblurring results on real-world dataset BLUR-Jrim2020real.
  • ...and 1 more figures