Table of Contents
Fetching ...

Synergy Between the Strong and the Weak: Spiking Neural Networks are Inherently Self-Distillers

Yongqi Ding, Lin Zuo, Mengmeng Jing, Kunshan Yang, Pei He, Tonglan Xie

TL;DR

This work addresses the performance gap in brain-inspired spiking neural networks by proposing self-distillation through temporal deconstruction: treating each timestep as a submodel and using per-timestep confidence to designate strong and weak roles. It introduces two schemes, Strong2Weak and Weak2Strong, with flexible implementations (ensemble, simultaneous, cascade) that distill knowledge without external teachers or added overhead, leveraging the dark knowledge across timesteps. Empirical results across static and neuromorphic datasets show consistent gains in accuracy, robustness, and low-latency performance, and demonstrate compatibility with early-exit inference. The method offers a scalable, resource-efficient route to high-performance SNNs with practical applicability to temporal tasks and edge compute scenarios.

Abstract

Brain-inspired spiking neural networks (SNNs) promise to be a low-power alternative to computationally intensive artificial neural networks (ANNs), although performance gaps persist. Recent studies have improved the performance of SNNs through knowledge distillation, but rely on large teacher models or introduce additional training overhead. In this paper, we show that SNNs can be naturally deconstructed into multiple submodels for efficient self-distillation. We treat each timestep instance of the SNN as a submodel and evaluate its output confidence, thus efficiently identifying the strong and the weak. Based on this strong and weak relationship, we propose two efficient self-distillation schemes: (1) \textbf{Strong2Weak}: During training, the stronger "teacher" guides the weaker "student", effectively improving overall performance. (2) \textbf{Weak2Strong}: The weak serve as the "teacher", distilling the strong in reverse with underlying dark knowledge, again yielding significant performance gains. For both distillation schemes, we offer flexible implementations such as ensemble, simultaneous, and cascade distillation. Experiments show that our method effectively improves the discriminability and overall performance of the SNN, while its adversarial robustness is also enhanced, benefiting from the stability brought by self-distillation. This ingeniously exploits the temporal properties of SNNs and provides insight into how to efficiently train high-performance SNNs.

Synergy Between the Strong and the Weak: Spiking Neural Networks are Inherently Self-Distillers

TL;DR

This work addresses the performance gap in brain-inspired spiking neural networks by proposing self-distillation through temporal deconstruction: treating each timestep as a submodel and using per-timestep confidence to designate strong and weak roles. It introduces two schemes, Strong2Weak and Weak2Strong, with flexible implementations (ensemble, simultaneous, cascade) that distill knowledge without external teachers or added overhead, leveraging the dark knowledge across timesteps. Empirical results across static and neuromorphic datasets show consistent gains in accuracy, robustness, and low-latency performance, and demonstrate compatibility with early-exit inference. The method offers a scalable, resource-efficient route to high-performance SNNs with practical applicability to temporal tasks and edge compute scenarios.

Abstract

Brain-inspired spiking neural networks (SNNs) promise to be a low-power alternative to computationally intensive artificial neural networks (ANNs), although performance gaps persist. Recent studies have improved the performance of SNNs through knowledge distillation, but rely on large teacher models or introduce additional training overhead. In this paper, we show that SNNs can be naturally deconstructed into multiple submodels for efficient self-distillation. We treat each timestep instance of the SNN as a submodel and evaluate its output confidence, thus efficiently identifying the strong and the weak. Based on this strong and weak relationship, we propose two efficient self-distillation schemes: (1) \textbf{Strong2Weak}: During training, the stronger "teacher" guides the weaker "student", effectively improving overall performance. (2) \textbf{Weak2Strong}: The weak serve as the "teacher", distilling the strong in reverse with underlying dark knowledge, again yielding significant performance gains. For both distillation schemes, we offer flexible implementations such as ensemble, simultaneous, and cascade distillation. Experiments show that our method effectively improves the discriminability and overall performance of the SNN, while its adversarial robustness is also enhanced, benefiting from the stability brought by self-distillation. This ingeniously exploits the temporal properties of SNNs and provides insight into how to efficiently train high-performance SNNs.

Paper Structure

This paper contains 34 sections, 10 equations, 5 figures, 18 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison to other distillation methods. Our self-distillation deconstructs an SNN into multiple submodels and identifies the strong and weak ones for self-distillation without any additional overhead.
  • Figure 2: Our method logically deconstructs the SNN with T timesteps into T submodels with the same architecture and parameters, and evaluates the confidence level of each submodel to identify the strong and the weak. The strong help the weak through distillation, and the weak transfer underlying dark knowledge to the strong, thus improving overall performance.
  • Figure 3: Visualization of the output distribution of each timestep submodel. (a) The vanilla SNN produces confusing outputs at $t=0$, showing dramatic gaps between the strong and the weak. (b) (c) Both Strong2Weak and Weak2Strong are able to improve the output discriminability of the submodels, thus bridging the gap between strong and weak to improve the overall stability and performance.
  • Figure 4: Visualization of the overall output of the vanilla SNN, Strong2Weak distillation, and Weak2Strong distillation. Strong2Weak and Weak2Strong distillation schemes provide superior performance by allowing for more discriminable outputs than the vanilla SNN.
  • Figure 5: The evolution trend of timestep indices corresponding to strong and weak submodels during training.