Table of Contents
Fetching ...

ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

Qizhen Lan, Qing Tian

TL;DR

ACAM-KD addresses the limitations of static, teacher-driven feature masking in knowledge distillation for dense prediction tasks. It introduces two key components: STCA-FF, a cross-attention fusion mechanism where the teacher queries and the student provides keys/values, and ASCM, which generates adaptive spatial-channel masks from the fused features; a diversity loss further ensures mask variety. The method yields consistent improvements on object detection benchmarks (COCO2017) and semantic segmentation (Cityscapes) across multiple backbones and detectors, demonstrating both effectiveness and adaptability during training. The results suggest that enabling dynamic, cooperative interactions between student and teacher can significantly enhance distillation quality while maintaining efficiency in dense-prediction settings.

Abstract

Dense visual prediction tasks, such as detection and segmentation, are crucial for time-critical applications (e.g., autonomous driving and video surveillance). While deep models achieve strong performance, their efficiency remains a challenge. Knowledge distillation (KD) is an effective model compression technique, but existing feature-based KD methods rely on static, teacher-driven feature selection, failing to adapt to the student's evolving learning state or leverage dynamic student-teacher interactions. To address these limitations, we propose Adaptive student-teacher Cooperative Attention Masking for Knowledge Distillation (ACAM-KD), which introduces two key components: (1) Student-Teacher Cross-Attention Feature Fusion (STCA-FF), which adaptively integrates features from both models for a more interactive distillation process, and (2) Adaptive Spatial-Channel Masking (ASCM), which dynamically generates importance masks to enhance both spatial and channel-wise feature selection. Unlike conventional KD methods, ACAM-KD adapts to the student's evolving needs throughout the entire distillation process. Extensive experiments on multiple benchmarks validate its effectiveness. For instance, on COCO2017, ACAM-KD improves object detection performance by up to 1.4 mAP over the state-of-the-art when distilling a ResNet-50 student from a ResNet-101 teacher. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline with DeepLabV3-MobileNetV2 as the student model.

ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

TL;DR

ACAM-KD addresses the limitations of static, teacher-driven feature masking in knowledge distillation for dense prediction tasks. It introduces two key components: STCA-FF, a cross-attention fusion mechanism where the teacher queries and the student provides keys/values, and ASCM, which generates adaptive spatial-channel masks from the fused features; a diversity loss further ensures mask variety. The method yields consistent improvements on object detection benchmarks (COCO2017) and semantic segmentation (Cityscapes) across multiple backbones and detectors, demonstrating both effectiveness and adaptability during training. The results suggest that enabling dynamic, cooperative interactions between student and teacher can significantly enhance distillation quality while maintaining efficiency in dense-prediction settings.

Abstract

Dense visual prediction tasks, such as detection and segmentation, are crucial for time-critical applications (e.g., autonomous driving and video surveillance). While deep models achieve strong performance, their efficiency remains a challenge. Knowledge distillation (KD) is an effective model compression technique, but existing feature-based KD methods rely on static, teacher-driven feature selection, failing to adapt to the student's evolving learning state or leverage dynamic student-teacher interactions. To address these limitations, we propose Adaptive student-teacher Cooperative Attention Masking for Knowledge Distillation (ACAM-KD), which introduces two key components: (1) Student-Teacher Cross-Attention Feature Fusion (STCA-FF), which adaptively integrates features from both models for a more interactive distillation process, and (2) Adaptive Spatial-Channel Masking (ASCM), which dynamically generates importance masks to enhance both spatial and channel-wise feature selection. Unlike conventional KD methods, ACAM-KD adapts to the student's evolving needs throughout the entire distillation process. Extensive experiments on multiple benchmarks validate its effectiveness. For instance, on COCO2017, ACAM-KD improves object detection performance by up to 1.4 mAP over the state-of-the-art when distilling a ResNet-50 student from a ResNet-101 teacher. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline with DeepLabV3-MobileNetV2 as the student model.

Paper Structure

This paper contains 23 sections, 9 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Visualization of attention maps from different models at various training stages. (a) input image, (b) teacher model’s attention, and (c-e) student model's attention at epochs 4, 12, and 24. The color represents attention intensity, with red indicating the highest focus and blue the lowest. The teacher model’s attention is static and suboptimal. The student’s attention evolves over time, with epoch 12 demonstrating better localization than the teacher.
  • Figure 2: Overview of the proposed Adaptive and Cooperative Attention Masking for Knowledge Distillation (ACAM-KD) framework. ACAM-KD leverages student-teacher cross-attention to dynamically and cooperatively guide the distillation process. Instead of relying on static, teacher-determined attention masking, our distillation masks are updated throughout distillation, adapting to the student's evolving learning needs and selectively focusing on the most beneficial spatial and channel-wise features at a particular learning stage. Different colors in the adaptive channel masks indicate varying levels of importance.
  • Figure 3: Visualization of spatial attention masks associated with different learnable selection units. Variations in highlighted regions, encouraged by our diversity loss, ensure complementary feature learning for effective knowledge distillation. Warmer colors (red/yellow) indicate higher attention, while cooler colors (blue) denote lower attention.