Frequency Attention for Knowledge Distillation

Cuong Pham; Van-Anh Nguyen; Trung Le; Dinh Phung; Gustavo Carneiro; Thanh-Toan Do

Frequency Attention for Knowledge Distillation

Cuong Pham, Van-Anh Nguyen, Trung Le, Dinh Phung, Gustavo Carneiro, Thanh-Toan Do

TL;DR

A novel module that functions as an attention mechanism in the frequency domain that can adjust the frequencies of student’s features under the guidance of the teacher’s features, which encourages the student’s features to have patterns similar to the teacher’s features.

Abstract

Knowledge distillation is an attractive approach for learning compact deep neural networks, which learns a lightweight student model by distilling knowledge from a complex teacher model. Attention-based knowledge distillation is a specific form of intermediate feature-based knowledge distillation that uses attention mechanisms to encourage the student to better mimic the teacher. However, most of the previous attention-based distillation approaches perform attention in the spatial domain, which primarily affects local regions in the input image. This may not be sufficient when we need to capture the broader context or global information necessary for effective knowledge transfer. In frequency domain, since each frequency is determined from all pixels of the image in spatial domain, it can contain global information about the image. Inspired by the benefits of the frequency domain, we propose a novel module that functions as an attention mechanism in the frequency domain. The module consists of a learnable global filter that can adjust the frequencies of student's features under the guidance of the teacher's features, which encourages the student's features to have patterns similar to the teacher's features. We then propose an enhanced knowledge review-based distillation model by leveraging the proposed frequency attention module. The extensive experiments with various teacher and student architectures on image classification and object detection benchmark datasets show that the proposed approach outperforms other knowledge distillation methods.

Frequency Attention for Knowledge Distillation

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 4 figures, 6 tables)

This paper contains 22 sections, 8 equations, 4 figures, 6 tables.

Introduction
Related work
Proposed method
Frequency attention module
Global filtering.
Computational complexity of the FAM module.
Applying FAM to knowledge distillation
Layer-to-layer intermediate feature-based knowledge distillation
Knowledge review distillation
Experiments
Experimental setup
Datasets.
Implementation details.
Comparison with the state of the art
Image classification
...and 7 more sections

Figures (4)

Figure 1: Fourier Frequency Attention Module. HPF stands for a high pass filter. In the global branch, the input student's feature map is transformed to the frequency domain using the FFT. The frequency is then adjusted by a learnable global filter. A high pass filter is then applied to the adjusted frequency map to filter out lowest frequencies. The local branch consists of a 1$\times$1 convolutional layer in the spatial domain. The outputs of the global and local branches are added and the resulting feature map is compared with the teacher's feature map. $\gamma_1$ and $\gamma_2$ are the learnable weighting parameters of the global and local branches, respectively.
Figure 2: The proposed enhanced layer-to-layer knowledge distillation. LA is the local attention and FAM is the proposed frequency attention module. $\mathcal{D}$ is the distance function. $F^T$ and $F^S$ represent the feature maps of teacher and student, respectively.
Figure 3: The proposed enhanced knowledge review distillation. CrossAT is the cross attention and FAM is the proposed frequency attention module. $\mathcal{D}$ is the distance function. $F^T$ and $F^S$ represent the feature maps of teacher and student, respectively.
Figure 4: (a) Original image. (b) - (e) Grad-CAMs Grad-cam from layer 9 of ResNet18 model when training (b) without knowledge distillation, (c) with OFD OFD, (d) with knowledge review ReviewKD, and (e) with FAM-KD (ours), respectively. When training with distillation, ResNet34 is used as the teacher. The figure shows that our FAM-KD (e) has better focus on the object than using OFD OFD and knowledge review ReviewKD.

Frequency Attention for Knowledge Distillation

TL;DR

Abstract

Frequency Attention for Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)