Table of Contents
Fetching ...

Temporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks

Kairong Yu, Chengting Yu, Tianqing Zhang, Xiaochen Zhao, Shu Yang, Hongwei Wang, Qiang Zhang, Qi Xu

TL;DR

This work targets the performance gap between Spiking Neural Networks (SNNs) and Artificial Neural Networks (ANNs) by tailoring knowledge distillation to the temporal nature of SNNs. The authors introduce Temporal Separation Knowledge Distillation with Entropy Regularization (TSER), which distills teacher logits at each time step and applies entropy-based stabilization to avoid propagating erroneous teacher knowledge. Key contributions include the temporal separation loss, the entropy regularization term, and extensive evaluation showing state-of-the-art results on CIFAR-10/100 and competitive performance on ImageNet, while maintaining energy-efficient operation. The approach advances practical SNN deployment by better leveraging spatiotemporal dynamics without adding prohibitive computation or time steps.

Abstract

Spiking Neural Networks (SNNs), inspired by the human brain, offer significant computational efficiency through discrete spike-based information transfer. Despite their potential to reduce inference energy consumption, a performance gap persists between SNNs and Artificial Neural Networks (ANNs), primarily due to current training methods and inherent model limitations. While recent research has aimed to enhance SNN learning by employing knowledge distillation (KD) from ANN teacher networks, traditional distillation techniques often overlook the distinctive spatiotemporal properties of SNNs, thus failing to fully leverage their advantages. To overcome these challenge, we propose a novel logit distillation method characterized by temporal separation and entropy regularization. This approach improves existing SNN distillation techniques by performing distillation learning on logits across different time steps, rather than merely on aggregated output features. Furthermore, the integration of entropy regularization stabilizes model optimization and further boosts the performance. Extensive experimental results indicate that our method surpasses prior SNN distillation strategies, whether based on logit distillation, feature distillation, or a combination of both. The code will be available on GitHub.

Temporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks

TL;DR

This work targets the performance gap between Spiking Neural Networks (SNNs) and Artificial Neural Networks (ANNs) by tailoring knowledge distillation to the temporal nature of SNNs. The authors introduce Temporal Separation Knowledge Distillation with Entropy Regularization (TSER), which distills teacher logits at each time step and applies entropy-based stabilization to avoid propagating erroneous teacher knowledge. Key contributions include the temporal separation loss, the entropy regularization term, and extensive evaluation showing state-of-the-art results on CIFAR-10/100 and competitive performance on ImageNet, while maintaining energy-efficient operation. The approach advances practical SNN deployment by better leveraging spatiotemporal dynamics without adding prohibitive computation or time steps.

Abstract

Spiking Neural Networks (SNNs), inspired by the human brain, offer significant computational efficiency through discrete spike-based information transfer. Despite their potential to reduce inference energy consumption, a performance gap persists between SNNs and Artificial Neural Networks (ANNs), primarily due to current training methods and inherent model limitations. While recent research has aimed to enhance SNN learning by employing knowledge distillation (KD) from ANN teacher networks, traditional distillation techniques often overlook the distinctive spatiotemporal properties of SNNs, thus failing to fully leverage their advantages. To overcome these challenge, we propose a novel logit distillation method characterized by temporal separation and entropy regularization. This approach improves existing SNN distillation techniques by performing distillation learning on logits across different time steps, rather than merely on aggregated output features. Furthermore, the integration of entropy regularization stabilizes model optimization and further boosts the performance. Extensive experimental results indicate that our method surpasses prior SNN distillation strategies, whether based on logit distillation, feature distillation, or a combination of both. The code will be available on GitHub.

Paper Structure

This paper contains 27 sections, 11 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of the Classical SNN KD and Our Proposed Method. In our approach, we remove the temporal dimension fusion operation present in classical SNN KD and apply a temporal separation strategy to focus the learning process on outputs at individual time steps. The original calculations of KL Loss and CE Loss are adapted to compute the loss for each time step’s output, followed by averaging. Additionally, we incorporate an entropy regularization term to guide the learning direction away from erroneous knowledge. Here, T, B, and C represent the time steps, batch size, and number of classes, respectively.
  • Figure 2: Prediction Accuracy Distribution at Different Time Steps for Vanilla SNN KD and Our Proposed Method. The solid red line indicates the teacher model’s accuracy, while the dashed lines represent the prediction accuracies of different distillation methods after averaging outputs over time steps. The bars show the prediction accuracies of each distillation method at individual time steps.
  • Figure 3: Accuracy Distribution for Different $\lambda$ Values. Experiments conducted on the CIFAR100 dataset with a fixed time step of 2, testing various $\lambda$ values.
  • Figure 4: Temperature Coefficient Sensitivity. Our method demonstrates stability across various $\tau$ hyperparameters. This experiment is conducted on CIFAR100 with ResNet-34 as the teacher model and ResNet-18 as the student model.
  • Figure 5: t-SNE Visualization of features learned by teacher ANN and different distillation methods.