Table of Contents
Fetching ...

Learning Scalable Temporal Representations in Spiking Neural Networks Without Labels

Chengwei Zhou, Gourav Datta

TL;DR

This work tackles the challenge of scalable self-supervised learning for spiking neural networks by introducing MixedLIF, a dual-path neuron that enables gradient flow through a differentiable surrogate while preserving a fully spiking path for inference. It leverages two temporally aware losses, Cross Temporal Loss and Boundary Temporal Loss, to exploit spike timing and align representations across augmented views. The approach enables ImageNet-scale pretraining and strong transfer to classification, detection, and segmentation, achieving a top-1 of 70.1% with Spikformer-16-512 and competitive results against non-spiking SSL baselines, while offering substantial energy efficiency during inference. Overall, the paper demonstrates that unlabeled learning in high-capacity SNNs is feasible at modern scale and provides a practical path toward scalable, hardware-friendly self-supervised neuromorphic learning.

Abstract

Spiking neural networks (SNNs) exhibit temporal, sparse, and event-driven dynamics that make them appealing for efficient inference. However, extending these models to self-supervised regimes remains challenging because the discontinuities introduced by spikes break the cross-view gradient correspondences required by contrastive and consistency-driven objectives. This work introduces a training paradigm that enables large SNN architectures to be optimized without labeled data. We formulate a dual-path neuron in which a spike-generating process is paired with a differentiable surrogate branch, allowing gradients to propagate across augmented inputs while preserving a fully spiking implementation at inference. In addition, we propose temporal alignment objectives that enforce representational coherence both across spike timesteps and between augmented views. Using convolutional and transformer-style SNN backbones, we demonstrate ImageNet-scale self-supervised pretraining and strong transfer to classification, detection, and segmentation benchmarks. Our best model, a fully self-supervised Spikformer-16-512, achieves 70.1% top-1 accuracy on ImageNet-1K, demonstrating that unlabeled learning in high-capacity SNNs is feasible at modern scale

Learning Scalable Temporal Representations in Spiking Neural Networks Without Labels

TL;DR

This work tackles the challenge of scalable self-supervised learning for spiking neural networks by introducing MixedLIF, a dual-path neuron that enables gradient flow through a differentiable surrogate while preserving a fully spiking path for inference. It leverages two temporally aware losses, Cross Temporal Loss and Boundary Temporal Loss, to exploit spike timing and align representations across augmented views. The approach enables ImageNet-scale pretraining and strong transfer to classification, detection, and segmentation, achieving a top-1 of 70.1% with Spikformer-16-512 and competitive results against non-spiking SSL baselines, while offering substantial energy efficiency during inference. Overall, the paper demonstrates that unlabeled learning in high-capacity SNNs is feasible at modern scale and provides a practical path toward scalable, hardware-friendly self-supervised neuromorphic learning.

Abstract

Spiking neural networks (SNNs) exhibit temporal, sparse, and event-driven dynamics that make them appealing for efficient inference. However, extending these models to self-supervised regimes remains challenging because the discontinuities introduced by spikes break the cross-view gradient correspondences required by contrastive and consistency-driven objectives. This work introduces a training paradigm that enables large SNN architectures to be optimized without labeled data. We formulate a dual-path neuron in which a spike-generating process is paired with a differentiable surrogate branch, allowing gradients to propagate across augmented inputs while preserving a fully spiking implementation at inference. In addition, we propose temporal alignment objectives that enforce representational coherence both across spike timesteps and between augmented views. Using convolutional and transformer-style SNN backbones, we demonstrate ImageNet-scale self-supervised pretraining and strong transfer to classification, detection, and segmentation benchmarks. Our best model, a fully self-supervised Spikformer-16-512, achieves 70.1% top-1 accuracy on ImageNet-1K, demonstrating that unlabeled learning in high-capacity SNNs is feasible at modern scale

Paper Structure

This paper contains 29 sections, 12 equations, 6 figures, 15 tables, 2 algorithms.

Figures (6)

  • Figure 1: Left: Core components of our weight-shared dual-path SNN architecture. Both paths share the same trainable parameters but differ in activations via the MixedLIF neuron: Path $A$ emits spikes using a standard LIF neuron, while Path $B$ uses the SG-antiderivative neuron to update using true gradients. Right: Visualization of the MixedLIF neuron’s dynamics, including input current $H[t]$, output spikes $O[t]$, post-spike membrane potential $V[t]$, and the surrogate gradient used for training.
  • Figure 2: Overview of our self-supervised training framework. Two independently distorted and time-augmented views, $X^A$ and $X^B$, are passed through separate processing paths $A$ and $B$. Path $A$ uses Leaky Integrate-and-Fire (LIF) neurons to generate discrete spike outputs $Z^A$, while path $B$ employs the antiderivative of the LIF surrogate function used in path $A$, yielding range-continuous outputs $Z^B$. Both sequences are then projected and compared via Cross Temporal or Boundary Temporal Loss to encourage temporally consistent and invariant representations.
  • Figure 3: Top-1 accuracies (%) and inference latencies for different time steps under linear evaluation on CIFAR-10.
  • Figure 4: Training time comparison with Spiking-ResNet34 on CIFAR-10 between different loss functions. All values are normalized to the non-cross temporal loss.
  • Figure 5: Layer-wise spiking rates over 4 time steps for Spiking-ResNet34 (top) and Spikformer-4-384 (bottom). Each cluster of four bars corresponds to one layer, with bar colors indicating time steps 1–4. The dashed horizontal line marks the average spiking rate across all layers and time steps.
  • ...and 1 more figures