Learning Scalable Temporal Representations in Spiking Neural Networks Without Labels
Chengwei Zhou, Gourav Datta
TL;DR
This work tackles the challenge of scalable self-supervised learning for spiking neural networks by introducing MixedLIF, a dual-path neuron that enables gradient flow through a differentiable surrogate while preserving a fully spiking path for inference. It leverages two temporally aware losses, Cross Temporal Loss and Boundary Temporal Loss, to exploit spike timing and align representations across augmented views. The approach enables ImageNet-scale pretraining and strong transfer to classification, detection, and segmentation, achieving a top-1 of 70.1% with Spikformer-16-512 and competitive results against non-spiking SSL baselines, while offering substantial energy efficiency during inference. Overall, the paper demonstrates that unlabeled learning in high-capacity SNNs is feasible at modern scale and provides a practical path toward scalable, hardware-friendly self-supervised neuromorphic learning.
Abstract
Spiking neural networks (SNNs) exhibit temporal, sparse, and event-driven dynamics that make them appealing for efficient inference. However, extending these models to self-supervised regimes remains challenging because the discontinuities introduced by spikes break the cross-view gradient correspondences required by contrastive and consistency-driven objectives. This work introduces a training paradigm that enables large SNN architectures to be optimized without labeled data. We formulate a dual-path neuron in which a spike-generating process is paired with a differentiable surrogate branch, allowing gradients to propagate across augmented inputs while preserving a fully spiking implementation at inference. In addition, we propose temporal alignment objectives that enforce representational coherence both across spike timesteps and between augmented views. Using convolutional and transformer-style SNN backbones, we demonstrate ImageNet-scale self-supervised pretraining and strong transfer to classification, detection, and segmentation benchmarks. Our best model, a fully self-supervised Spikformer-16-512, achieves 70.1% top-1 accuracy on ImageNet-1K, demonstrating that unlabeled learning in high-capacity SNNs is feasible at modern scale
