Self-Distillation Learning Based on Temporal-Spatial Consistency for Spiking Neural Networks
Lin Zuo, Yongqi Ding, Mengmeng Jing, Kunshan Yang, Yunqian Yu
TL;DR
This work tackles the overhead and rigidity of traditional knowledge distillation for Spiking Neural Networks (SNNs) by introducing Temporal-Spatial Self-Distillation (TSSD). TSSD decouples training and inference timesteps by using a temporally extended teacher ($T_t$) and a spatially guided intermediate classifier, while keeping inference at ultra-low latency ($T_s$) with no extra inference cost. The method optimizes a combined loss $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \alpha \mathcal{L}_{\text{tsd}} + \beta \mathcal{L}_{\text{ssd}}$, enabling the student to learn from non-deterministic soft labels in both temporal and spatial dimensions. Extensive results on CIFAR10/100, ImageNet, CIFAR10-DVS, and DVS-Gesture show consistent gains over baselines and competitive or superior performance at low timesteps, underscoring the approach’s scalability, generalizability, and practical impact for energy-efficient, real-time SNN deployments.
Abstract
Spiking neural networks (SNNs) have attracted considerable attention for their event-driven, low-power characteristics and high biological interpretability. Inspired by knowledge distillation (KD), recent research has improved the performance of the SNN model with a pre-trained teacher model. However, additional teacher models require significant computational resources, and it is tedious to manually define the appropriate teacher network architecture. In this paper, we explore cost-effective self-distillation learning of SNNs to circumvent these concerns. Without an explicit defined teacher, the SNN generates pseudo-labels and learns consistency during training. On the one hand, we extend the timestep of the SNN during training to create an implicit temporal ``teacher" that guides the learning of the original ``student", i.e., the temporal self-distillation. On the other hand, we guide the output of the weak classifier at the intermediate stage by the final output of the SNN, i.e., the spatial self-distillation. Our temporal-spatial self-distillation (TSSD) learning method does not introduce any inference overhead and has excellent generalization ability. Extensive experiments on the static image datasets CIFAR10/100 and ImageNet as well as the neuromorphic datasets CIFAR10-DVS and DVS-Gesture validate the superior performance of the TSSD method. This paper presents a novel manner of fusing SNNs with KD, providing insights into high-performance SNN learning methods.
