Table of Contents
Fetching ...

Like What You Like: Knowledge Distill via Neuron Selectivity Transfer

Zehao Huang, Naiyan Wang

TL;DR

The paper tackles the efficiency gap in deep networks by introducing Neuron Selectivity Transfer (NST), which distills knowledge by matching distributions of neuron activations rather than directly copying feature maps. NST uses Maximum Mean Discrepancy (MMD) to align teacher and student activation patterns, with kernel choices (linear, polynomial, Gaussian) capturing different aspects of neuron selectivity. Empirical results on CIFAR, ImageNet, and PASCAL VOC demonstrate that NST improves student performance and complements existing KT methods like KD, FitNet, and AT, with transferability to object detection. The work also outlines extensions beyond MMD, including GAN-based distribution matching, and positions NST as a versatile framework for broader knowledge transfer tasks.

Abstract

Despite deep neural networks have demonstrated extraordinary power in various applications, their superior performances are at expense of high storage and computational costs. Consequently, the acceleration and compression of neural networks have attracted much attention recently. Knowledge Transfer (KT), which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the popular solutions. In this paper, we propose a novel knowledge transfer method by treating it as a distribution matching problem. Particularly, we match the distributions of neuron selectivity patterns between teacher and student networks. To achieve this goal, we devise a new KT loss function by minimizing the Maximum Mean Discrepancy (MMD) metric between these distributions. Combined with the original loss function, our method can significantly improve the performance of student networks. We validate the effectiveness of our method across several datasets, and further combine it with other KT methods to explore the best possible results. Last but not least, we fine-tune the model to other tasks such as object detection. The results are also encouraging, which confirm the transferability of the learned features.

Like What You Like: Knowledge Distill via Neuron Selectivity Transfer

TL;DR

The paper tackles the efficiency gap in deep networks by introducing Neuron Selectivity Transfer (NST), which distills knowledge by matching distributions of neuron activations rather than directly copying feature maps. NST uses Maximum Mean Discrepancy (MMD) to align teacher and student activation patterns, with kernel choices (linear, polynomial, Gaussian) capturing different aspects of neuron selectivity. Empirical results on CIFAR, ImageNet, and PASCAL VOC demonstrate that NST improves student performance and complements existing KT methods like KD, FitNet, and AT, with transferability to object detection. The work also outlines extensions beyond MMD, including GAN-based distribution matching, and positions NST as a versatile framework for broader knowledge transfer tasks.

Abstract

Despite deep neural networks have demonstrated extraordinary power in various applications, their superior performances are at expense of high storage and computational costs. Consequently, the acceleration and compression of neural networks have attracted much attention recently. Knowledge Transfer (KT), which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the popular solutions. In this paper, we propose a novel knowledge transfer method by treating it as a distribution matching problem. Particularly, we match the distributions of neuron selectivity patterns between teacher and student networks. To achieve this goal, we devise a new KT loss function by minimizing the Maximum Mean Discrepancy (MMD) metric between these distributions. Combined with the original loss function, our method can significantly improve the performance of student networks. We validate the effectiveness of our method across several datasets, and further combine it with other KT methods to explore the best possible results. Last but not least, we fine-tune the model to other tasks such as object detection. The results are also encouraging, which confirm the transferability of the learned features.

Paper Structure

This paper contains 21 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The architecture for our Neuron Selectivity Transfer: the student network is not only trained from ground-truth labels, but also mimics the distribution of the activations from intermediate layers in the teacher network. Each dot or triangle in the figure denotes its corresponding activation map of a filter.
  • Figure 2: Neuron activation heat map of two selected images.
  • Figure 3: Different knowledge transfer methods on CIFAR10 and CIFAR100. Test errors are in bold, while train errors are in dashed lines. Our NST improves final accuracy observably with a fast convergence speed. Best view in color.
  • Figure 4: Top-1 validation error of different knowledge transfer methods on ImageNet. Best view in color.
  • Figure 5: t-SNE maaten2008visualizing visualization shows that our NST Transfer reduces the distance between teacher and student activations distribution.