Table of Contents
Fetching ...

Attention-guided Feature Distillation for Semantic Segmentation

Amir M. Mansourian, Arya Jalali, Rozhan Ahmadi, Shohreh Kasaei

TL;DR

This work targets efficient semantic segmentation by bridging the gap between large and small models. It introduces Attention-guided Feature Distillation (AttnFD), which refines intermediate features with Convolutional Block Attention Module (CBAM) and transfers knowledge using a simple mean-squared error between normalized, refined teacher and student features, paired with standard cross-entropy loss. Across Pascal VOC, Cityscapes, COCO, and CamVid, AttnFD consistently outperforms existing distillation approaches, demonstrating that attention-based feature refinement can succinctly capture essential context without complex loss designs. The approach is architecture-agnostic, scales across multiple backbones, and offers a practical balance between accuracy and efficiency for dense prediction tasks.

Abstract

Deep learning models have achieved significant results across various computer vision tasks. However, due to the large number of parameters in these models, deploying them in real-time scenarios is a critical challenge, specifically in dense prediction tasks such as semantic segmentation. Knowledge distillation has emerged as a successful technique for addressing this problem by transferring knowledge from a cumbersome model (teacher) to a lighter model (student). In contrast to existing complex methodologies commonly employed for distilling knowledge from a teacher to a student, this paper showcases the efficacy of a simple yet powerful method for utilizing refined feature maps to transfer attention. The proposed method has proven to be effective in distilling rich information, outperforming existing methods in semantic segmentation as a dense prediction task. The proposed Attention-guided Feature Distillation (AttnFD) method, employs the Convolutional Block Attention Module (CBAM), which refines feature maps by taking into account both channel-specific and spatial information content. Simply using the Mean Squared Error (MSE) loss function between the refined feature maps of the teacher and the student, AttnFD demonstrates outstanding performance in semantic segmentation, achieving state-of-the-art results in terms of improving the mean Intersection over Union (mIoU) of the student network on the PascalVoc 2012, Cityscapes, COCO, and CamVid datasets.

Attention-guided Feature Distillation for Semantic Segmentation

TL;DR

This work targets efficient semantic segmentation by bridging the gap between large and small models. It introduces Attention-guided Feature Distillation (AttnFD), which refines intermediate features with Convolutional Block Attention Module (CBAM) and transfers knowledge using a simple mean-squared error between normalized, refined teacher and student features, paired with standard cross-entropy loss. Across Pascal VOC, Cityscapes, COCO, and CamVid, AttnFD consistently outperforms existing distillation approaches, demonstrating that attention-based feature refinement can succinctly capture essential context without complex loss designs. The approach is architecture-agnostic, scales across multiple backbones, and offers a practical balance between accuracy and efficiency for dense prediction tasks.

Abstract

Deep learning models have achieved significant results across various computer vision tasks. However, due to the large number of parameters in these models, deploying them in real-time scenarios is a critical challenge, specifically in dense prediction tasks such as semantic segmentation. Knowledge distillation has emerged as a successful technique for addressing this problem by transferring knowledge from a cumbersome model (teacher) to a lighter model (student). In contrast to existing complex methodologies commonly employed for distilling knowledge from a teacher to a student, this paper showcases the efficacy of a simple yet powerful method for utilizing refined feature maps to transfer attention. The proposed method has proven to be effective in distilling rich information, outperforming existing methods in semantic segmentation as a dense prediction task. The proposed Attention-guided Feature Distillation (AttnFD) method, employs the Convolutional Block Attention Module (CBAM), which refines feature maps by taking into account both channel-specific and spatial information content. Simply using the Mean Squared Error (MSE) loss function between the refined feature maps of the teacher and the student, AttnFD demonstrates outstanding performance in semantic segmentation, achieving state-of-the-art results in terms of improving the mean Intersection over Union (mIoU) of the student network on the PascalVoc 2012, Cityscapes, COCO, and CamVid datasets.
Paper Structure (24 sections, 6 equations, 10 figures, 6 tables)

This paper contains 24 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Visualization of images (a), raw feature maps (b), and refined feature maps (c). Channel and spatial attention is applied to raw features, emphasizing on the important regions and making them valuable distillation source.
  • Figure 2: Overall diagram of the proposed distillation method. The student model is trained with cross-entropy loss ($L_{ce}$) along with distillation loss between the refined feature maps of the teacher and the student ($L_{attn}$). Refined feature maps are obtained by applying channel attention, followed by spatial attention on the feature maps. The teacher's parameters are frozen during the training of the student, and any inconsistency between the sizes of the features is compensated using interpolation and convolution operations.
  • Figure 3: Channel Attention Module. It applies average-pooling and max-pooling operators along the channel dimension. Resulting outputs are then passed through a shared Multi-Layer Perceptron and fed into a sigmoid activation function to generate the channel attention map $M_C$.
  • Figure 4: Spatial Attention Module. Max-pooling and average-pooling operators are utilized to generate feature descriptors. These descriptors are subsequently fed into a convolution layer and a sigmoid activation function, resulting in spatial attention map $M_S$.
  • Figure 5: Some qualitative comparisons on the Pascal VOC validation split.
  • ...and 5 more figures