Table of Contents
Fetching ...

A Closer Look at Knowledge Distillation in Spiking Neural Network Training

Xu Liu, Na Xia, Jinxing Zhou, Jingyuan Xu, Dan Guo

TL;DR

This work tackles the challenge of training energy-efficient Spiking Neural Networks (SNNs) via knowledge distillation from pretrained ANNs. It introduces two KD strategies—Saliency-scaled Activation Map Distillation (SAMD) and Noise-smoothed Logits Distillation (NLD)—to bridge semantic and distribution gaps between continuous ANN features/logits and discrete, sparse SNN representations. SAMD aligns the SNN spike activation map (SAM) with the teacher’s class activation map (CAM) using softmax-normalized saliency distributions, while NLD smooths SNN logits with Gaussian noise to resemble the teacher’s continuous logits. Across CIFAR-10/100, ImageNet-1K, and CIFAR10-DVS, CKDSNN achieves state-of-the-art accuracy with favorable energy-efficiency trade-offs, demonstrating robust cross-domain knowledge transfer for SNNs.

Abstract

Spiking Neural Networks (SNNs) become popular due to excellent energy efficiency, yet facing challenges for effective model training. Recent works improve this by introducing knowledge distillation (KD) techniques, with the pre-trained artificial neural networks (ANNs) used as teachers and the target SNNs as students. This is commonly accomplished through a straightforward element-wise alignment of intermediate features and prediction logits from ANNs and SNNs, often neglecting the intrinsic differences between their architectures. Specifically, ANN's outputs exhibit a continuous distribution, whereas SNN's outputs are characterized by sparsity and discreteness. To mitigate this issue, we introduce two innovative KD strategies. Firstly, we propose the Saliency-scaled Activation Map Distillation (SAMD), which aligns the spike activation map of the student SNN with the class-aware activation map of the teacher ANN. Rather than performing KD directly on the raw %and distinct features of ANN and SNN, our SAMD directs the student to learn from saliency activation maps that exhibit greater semantic and distribution consistency. Additionally, we propose a Noise-smoothed Logits Distillation (NLD), which utilizes Gaussian noise to smooth the sparse logits of student SNN, facilitating the alignment with continuous logits from teacher ANN. Extensive experiments on multiple datasets demonstrate the effectiveness of our methods. Code is available~\footnote{https://github.com/SinoLeu/CKDSNN.git}.

A Closer Look at Knowledge Distillation in Spiking Neural Network Training

TL;DR

This work tackles the challenge of training energy-efficient Spiking Neural Networks (SNNs) via knowledge distillation from pretrained ANNs. It introduces two KD strategies—Saliency-scaled Activation Map Distillation (SAMD) and Noise-smoothed Logits Distillation (NLD)—to bridge semantic and distribution gaps between continuous ANN features/logits and discrete, sparse SNN representations. SAMD aligns the SNN spike activation map (SAM) with the teacher’s class activation map (CAM) using softmax-normalized saliency distributions, while NLD smooths SNN logits with Gaussian noise to resemble the teacher’s continuous logits. Across CIFAR-10/100, ImageNet-1K, and CIFAR10-DVS, CKDSNN achieves state-of-the-art accuracy with favorable energy-efficiency trade-offs, demonstrating robust cross-domain knowledge transfer for SNNs.

Abstract

Spiking Neural Networks (SNNs) become popular due to excellent energy efficiency, yet facing challenges for effective model training. Recent works improve this by introducing knowledge distillation (KD) techniques, with the pre-trained artificial neural networks (ANNs) used as teachers and the target SNNs as students. This is commonly accomplished through a straightforward element-wise alignment of intermediate features and prediction logits from ANNs and SNNs, often neglecting the intrinsic differences between their architectures. Specifically, ANN's outputs exhibit a continuous distribution, whereas SNN's outputs are characterized by sparsity and discreteness. To mitigate this issue, we introduce two innovative KD strategies. Firstly, we propose the Saliency-scaled Activation Map Distillation (SAMD), which aligns the spike activation map of the student SNN with the class-aware activation map of the teacher ANN. Rather than performing KD directly on the raw %and distinct features of ANN and SNN, our SAMD directs the student to learn from saliency activation maps that exhibit greater semantic and distribution consistency. Additionally, we propose a Noise-smoothed Logits Distillation (NLD), which utilizes Gaussian noise to smooth the sparse logits of student SNN, facilitating the alignment with continuous logits from teacher ANN. Extensive experiments on multiple datasets demonstrate the effectiveness of our methods. Code is available~\footnote{https://github.com/SinoLeu/CKDSNN.git}.

Paper Structure

This paper contains 23 sections, 23 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: (a) Prior KD methods simply aligns the raw hidden features and output logits between teacher ANN and student SNN, ignoring discrepancies in their distributions. (b) We perform the KD through more precise and semantic-consistent saliency maps, aligning the spiking activation map of SNN with the class activation map with ANN. Besides, we utilize Gaussian noise to smooth the raw logits of SNN, reducing the discrepancy in logits distillation.
  • Figure 2: Overview of our CKDSNN. (a) CKDSNN framework aims to improve the student SNN training by distilling knowledge from a pretrained teacher ANN. CKDSNN is primarily composed of two strategies. (b) The Saliency-scaled Activation Map Distillation (SAMD) utilizes the class activation map (CAM) from the ANN to guide the SNN to generate precise spike activations in salient regions, i.e., the spike activation map (SAM). Saliency-scaled is used to scale the CAM and SAM into magnitude-unified distributions. (c) The Noise-smoothed Logits Distillation (NLD) utilizes Gaussian noise to soften the sparse logits of the SNN, better matching with logits of the ANN.
  • Figure 3: Illustration of (a) the main challenge of applying Grad-CAM-like strategies in SNNs is the error caused by surrogate gradients. (b) Our SAM directly computes the spike activation rate of SNNs via SAM Generation. (c) Visualization of the generated CAMs and SAMs for SNN.
  • Figure 4: Ablation study (a) effectiveness of CKDSNN's core strategy. (b) Comparison of SAM-CAM in SAMD with previous activation map-based ANN KD methods at t=1.
  • Figure 5: Comparison our adaptive noise strategy with random noise.
  • ...and 7 more figures