LAKD-Activation Mapping Distillation Based on Local Learning

Yaoze Zhang; Yuming Zhang; Yu Zhao; Yue Zhang; Feiyu Zhu

LAKD-Activation Mapping Distillation Based on Local Learning

Yaoze Zhang, Yuming Zhang, Yu Zhao, Yue Zhang, Feiyu Zhu

TL;DR

Knowledge Distillation often suffers from tightly coupled learning objectives that hinder interpretability and efficient use of teacher information. LAKD proposes Separation-Decoupling Mechanism (SDM) and Non-Directional Activation Mapping (NDAM) to decouple layer-wise knowledge transfer and guide attention across independent modules. The approach yields improved interpretability, reduced memory usage, and state-of-the-art or competitive performance on CIFAR-100 and ImageNet across multiple architectures. This combination enables more robust, locality-aware distillation that better leverages teacher knowledge while mitigating overfitting to teacher features. Overall, LAKD advances KD by coupling local learning with attention-guided decoupling to improve both learning efficiency and generalization.

Abstract

Knowledge distillation is widely applied in various fundamental vision models to enhance the performance of compact models. Existing knowledge distillation methods focus on designing different distillation targets to acquire knowledge from teacher models. However, these methods often overlook the efficient utilization of distilled information, crudely coupling different types of information, making it difficult to explain how the knowledge from the teacher network aids the student network in learning. This paper proposes a novel knowledge distillation framework, Local Attention Knowledge Distillation (LAKD), which more efficiently utilizes the distilled information from teacher networks, achieving higher interpretability and competitive performance. The framework establishes an independent interactive training mechanism through a separation-decoupling mechanism and non-directional activation mapping. LAKD decouples the teacher's features and facilitates progressive interaction training from simple to complex. Specifically, the student network is divided into local modules with independent gradients to decouple the knowledge transferred from the teacher. The non-directional activation mapping helps the student network integrate knowledge from different local modules by learning coarse-grained feature knowledge. We conducted experiments on the CIFAR-10, CIFAR-100, and ImageNet datasets, and the results show that our LAKD method significantly outperforms existing methods, consistently achieving state-of-the-art performance across different datasets.

LAKD-Activation Mapping Distillation Based on Local Learning

TL;DR

Abstract

Paper Structure (21 sections, 12 equations, 7 figures, 5 tables)

This paper contains 21 sections, 12 equations, 7 figures, 5 tables.

Introduction
Related Work
Knowledge Distillation
Local Learning
Method
Preliminaries
Local Attention Knowledge Distillation
Separation-Decoupling Mechanism
Non-Directional Activation Mapping
Experiments
Implement Details
Contrastive Results
Results on CIFRA100
Comparison of Different Detachment Module Settings
Comparison of Other Dataset
...and 6 more sections

Figures (7)

Figure 1: Comparison of teacher-student Layer Centered Kernel Alignment (CKA) kornblith2019similarity similarity across different distillation methods. The teacher model is ResNet56, and the student model is ResNet20, using the following distillation methods: LAKD (Ours), CATKD catkd, FitNet hint, and KD hint. FitNet aligns each layer individually.
Figure 2: Overview of the proposed LAKD framework. Our method combines the Separation-Decoupling Mechanism (SDM) and Non-Directional Activation Mapping (NDAM). Building on CAT-KD, we add feature distillation for earlier layers. SDM applies gradient detachment to isolate the alignment tasks for each layer, while NDAM uses weights to integrate prior information and guide each module to focus on critical regions identified by the teacher.
Figure 3: Illustration of the Non-Directional Activation Mapping (NDAM) Module. The module integrates feature maps using both average pooling (AVG) and maximum pooling (MAX) operations. The outputs are then combined to generate a refined activation map that highlights key regions of the input feature map.
Figure 4: Top: Feature map of CAT-KD. Bottom: Feature map of LAKD.
Figure 5: L2 loss of each layer of students and teachers under different Detachment Module settings
...and 2 more figures

LAKD-Activation Mapping Distillation Based on Local Learning

TL;DR

Abstract

LAKD-Activation Mapping Distillation Based on Local Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)