Table of Contents
Fetching ...

PredNext: Explicit Cross-View Temporal Prediction for Unsupervised Learning in Spiking Neural Networks

Yiting Dong, Jianhao Ding, Zijie Xu, Tong Bu, Zhaofei Yu, Tiejun Huang

Abstract

Spiking Neural Networks (SNNs), with their temporal processing capabilities and biologically plausible dynamics, offer a natural platform for unsupervised representation learning. However, current unsupervised SNNs predominantly employ shallow architectures or localized plasticity rules, limiting their ability to model long-range temporal dependencies and maintain temporal feature consistency. This results in semantically unstable representations, thereby impeding the development of deep unsupervised SNNs for large-scale temporal video data. We propose PredNext, which explicitly models temporal relationships through cross-view future Step Prediction and Clip Prediction. This plug-and-play module seamlessly integrates with diverse self-supervised objectives. We firstly establish standard benchmarks for SNN self-supervised learning on UCF101, HMDB51, and MiniKinetics, which are substantially larger than conventional DVS datasets. PredNext delivers significant performance improvements across different tasks and self-supervised methods. PredNext achieves performance comparable to ImageNet-pretrained supervised weights, through unsupervised training solely on UCF101. Additional experiments demonstrate that PredNext, distinct from forced consistency constraints, substantially improves temporal feature consistency while enhancing network generalization capabilities. This work provides a effective foundation for unsupervised deep SNNs on large-scale temporal video data.

PredNext: Explicit Cross-View Temporal Prediction for Unsupervised Learning in Spiking Neural Networks

Abstract

Spiking Neural Networks (SNNs), with their temporal processing capabilities and biologically plausible dynamics, offer a natural platform for unsupervised representation learning. However, current unsupervised SNNs predominantly employ shallow architectures or localized plasticity rules, limiting their ability to model long-range temporal dependencies and maintain temporal feature consistency. This results in semantically unstable representations, thereby impeding the development of deep unsupervised SNNs for large-scale temporal video data. We propose PredNext, which explicitly models temporal relationships through cross-view future Step Prediction and Clip Prediction. This plug-and-play module seamlessly integrates with diverse self-supervised objectives. We firstly establish standard benchmarks for SNN self-supervised learning on UCF101, HMDB51, and MiniKinetics, which are substantially larger than conventional DVS datasets. PredNext delivers significant performance improvements across different tasks and self-supervised methods. PredNext achieves performance comparable to ImageNet-pretrained supervised weights, through unsupervised training solely on UCF101. Additional experiments demonstrate that PredNext, distinct from forced consistency constraints, substantially improves temporal feature consistency while enhancing network generalization capabilities. This work provides a effective foundation for unsupervised deep SNNs on large-scale temporal video data.

Paper Structure

This paper contains 24 sections, 9 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Analysis of temporal consistency. (a) Evolution of inter-frame feature similarity during SNN training. (b) Distribution of video features in high-dimensional space, demonstrating more concentrated clustering for high-consistency temporal representations. Blue points represent features from different timesteps of the same video, while red points indicate cluster centers in nearby feature space locations. Green and red arrows denote intra-video feature attraction across frames and inter-video feature repulsion respectively.
  • Figure 2: PredNext algorithmic framework. PredNext incorporates Step Prediction and Clip Prediction components for predicting features at the next step and in subsequent sampled clips from the same video, respectively. As an auxiliary module, PredNext can be seamlessly integrated into existing self-supervised learning methods. Red arrows indicate the Step Prediction pathway, while Blue arrows denote the Clip Prediction pathway.
  • Figure 3: Implementation for self-supervised learning in SNNs, encompassing SimCLR, MoCo, SimSiam, BYOL, BarlowTwins. Temporal features are aggregated following SNN encoder.
  • Figure 4: Analysis of temporal feature visualization.Top row: evolution of temporal consistency error during training across methods. Middle and bottom rows: UMAP visualizations of video features from baseline self-supervised methods and their PredNext-enhanced variants, respectively.
  • Figure 5: Visualization of retrieval results. Query videos (in left) with corresponding Top-3 retrieval results. Results for three query samples shown, with one sample per row.
  • ...and 3 more figures