Table of Contents
Fetching ...

Self-Distillation Learning Based on Temporal-Spatial Consistency for Spiking Neural Networks

Lin Zuo, Yongqi Ding, Mengmeng Jing, Kunshan Yang, Yunqian Yu

TL;DR

This work tackles the overhead and rigidity of traditional knowledge distillation for Spiking Neural Networks (SNNs) by introducing Temporal-Spatial Self-Distillation (TSSD). TSSD decouples training and inference timesteps by using a temporally extended teacher ($T_t$) and a spatially guided intermediate classifier, while keeping inference at ultra-low latency ($T_s$) with no extra inference cost. The method optimizes a combined loss $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \alpha \mathcal{L}_{\text{tsd}} + \beta \mathcal{L}_{\text{ssd}}$, enabling the student to learn from non-deterministic soft labels in both temporal and spatial dimensions. Extensive results on CIFAR10/100, ImageNet, CIFAR10-DVS, and DVS-Gesture show consistent gains over baselines and competitive or superior performance at low timesteps, underscoring the approach’s scalability, generalizability, and practical impact for energy-efficient, real-time SNN deployments.

Abstract

Spiking neural networks (SNNs) have attracted considerable attention for their event-driven, low-power characteristics and high biological interpretability. Inspired by knowledge distillation (KD), recent research has improved the performance of the SNN model with a pre-trained teacher model. However, additional teacher models require significant computational resources, and it is tedious to manually define the appropriate teacher network architecture. In this paper, we explore cost-effective self-distillation learning of SNNs to circumvent these concerns. Without an explicit defined teacher, the SNN generates pseudo-labels and learns consistency during training. On the one hand, we extend the timestep of the SNN during training to create an implicit temporal ``teacher" that guides the learning of the original ``student", i.e., the temporal self-distillation. On the other hand, we guide the output of the weak classifier at the intermediate stage by the final output of the SNN, i.e., the spatial self-distillation. Our temporal-spatial self-distillation (TSSD) learning method does not introduce any inference overhead and has excellent generalization ability. Extensive experiments on the static image datasets CIFAR10/100 and ImageNet as well as the neuromorphic datasets CIFAR10-DVS and DVS-Gesture validate the superior performance of the TSSD method. This paper presents a novel manner of fusing SNNs with KD, providing insights into high-performance SNN learning methods.

Self-Distillation Learning Based on Temporal-Spatial Consistency for Spiking Neural Networks

TL;DR

This work tackles the overhead and rigidity of traditional knowledge distillation for Spiking Neural Networks (SNNs) by introducing Temporal-Spatial Self-Distillation (TSSD). TSSD decouples training and inference timesteps by using a temporally extended teacher () and a spatially guided intermediate classifier, while keeping inference at ultra-low latency () with no extra inference cost. The method optimizes a combined loss , enabling the student to learn from non-deterministic soft labels in both temporal and spatial dimensions. Extensive results on CIFAR10/100, ImageNet, CIFAR10-DVS, and DVS-Gesture show consistent gains over baselines and competitive or superior performance at low timesteps, underscoring the approach’s scalability, generalizability, and practical impact for energy-efficient, real-time SNN deployments.

Abstract

Spiking neural networks (SNNs) have attracted considerable attention for their event-driven, low-power characteristics and high biological interpretability. Inspired by knowledge distillation (KD), recent research has improved the performance of the SNN model with a pre-trained teacher model. However, additional teacher models require significant computational resources, and it is tedious to manually define the appropriate teacher network architecture. In this paper, we explore cost-effective self-distillation learning of SNNs to circumvent these concerns. Without an explicit defined teacher, the SNN generates pseudo-labels and learns consistency during training. On the one hand, we extend the timestep of the SNN during training to create an implicit temporal ``teacher" that guides the learning of the original ``student", i.e., the temporal self-distillation. On the other hand, we guide the output of the weak classifier at the intermediate stage by the final output of the SNN, i.e., the spatial self-distillation. Our temporal-spatial self-distillation (TSSD) learning method does not introduce any inference overhead and has excellent generalization ability. Extensive experiments on the static image datasets CIFAR10/100 and ImageNet as well as the neuromorphic datasets CIFAR10-DVS and DVS-Gesture validate the superior performance of the TSSD method. This paper presents a novel manner of fusing SNNs with KD, providing insights into high-performance SNN learning methods.
Paper Structure (24 sections, 15 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 15 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of distillation methods. Our self-distillation learning eliminates the need for the additional teacher model required for vanilla distillation, thus eliminating significant overhead.
  • Figure 2: Illustration of the TSSD method. $T_s$ is the inference timestep, the training timestep is extended to $T_t>T_s$. The SNN with $T_t$ timesteps as the "teacher" guides the training of the "student" SNN with the top $T_s$ timesteps. Inference takes only $T_s$ timesteps with no additional overhead. From the spatial perspective, the final output of the SNN serves as the teacher to guide the weak output produced by the weak classifier. The weak classifier is discarded during inference.
  • Figure 3: Influence of $T_t$. The overall performance of the TSSD method improves as $T_t$ increases.
  • Figure 4: Influence of $\alpha$ and $\beta$. Our TSSD method is insensitive to $\alpha$ and $\beta$, and consistently yielding much better performance than the baseline.
  • Figure 5: Accuracy curves during training (left: CIFAR10, right: CIFAR10-DVS).
  • ...and 1 more figures