Table of Contents
Fetching ...

Unleashing Temporal Capacity of Spiking Neural Networks through Spatiotemporal Separation

Yiting Dong, Zhaofei Yu, Jianhao Ding, Zijie Xu, Tiejun Huang

TL;DR

This work questions the convention that membrane potential propagation is universally essential for temporal processing in spiking neural networks. Through non-stateful ablations, it uncovers a non-monotonic relationship between temporal state and performance, attributed to spatio-temporal resource competition. To address this, the authors propose STSep, a Spatial-Temporal Separable Network that decouples spatial semantics from temporal dynamics using a stateless spatial path and a differencing-based temporal path, fused with a controlled balance. STSep achieves state-of-the-art results on Something-Something V2, UCF101, and HMDB51, with ablations and visualizations showing motion-focused representations and robust retrieval performance, offering a practical architectural paradigm for efficient spatiotemporal modeling in video understanding.

Abstract

Spiking Neural Networks (SNNs) are considered naturally suited for temporal processing, with membrane potential propagation widely regarded as the core temporal modeling mechanism. However, existing research lack analysis of its actual contributions in complex temporal tasks. We design Non-Stateful (NS) models progressively removing membrane propagation to quantify its stage-wise role. Experiments reveal a counterintuitive phenomenon: moderate removal in shallow or deep layers improves performance, while excessive removal causes collapse. We attribute this to spatio-temporal resource competition where neurons encode both semantics and dynamics within limited range, with temporal state consuming capacity for spatial learning. Based on this, we propose Spatial-Temporal Separable Network (STSep), decoupling residual blocks into independent spatial and temporal branches. The spatial branch focuses on semantic extraction while the temporal branch captures motion through explicit temporal differences. Experiments on Something-Something V2, UCF101, and HMDB51 show STSep achieves superior performance, with retrieval task and attention analysis confirming focus on motion rather than static appearance. This work provides new perspectives on SNNs' temporal mechanisms and an effective solution for spatiotemporal modeling in video understanding.

Unleashing Temporal Capacity of Spiking Neural Networks through Spatiotemporal Separation

TL;DR

This work questions the convention that membrane potential propagation is universally essential for temporal processing in spiking neural networks. Through non-stateful ablations, it uncovers a non-monotonic relationship between temporal state and performance, attributed to spatio-temporal resource competition. To address this, the authors propose STSep, a Spatial-Temporal Separable Network that decouples spatial semantics from temporal dynamics using a stateless spatial path and a differencing-based temporal path, fused with a controlled balance. STSep achieves state-of-the-art results on Something-Something V2, UCF101, and HMDB51, with ablations and visualizations showing motion-focused representations and robust retrieval performance, offering a practical architectural paradigm for efficient spatiotemporal modeling in video understanding.

Abstract

Spiking Neural Networks (SNNs) are considered naturally suited for temporal processing, with membrane potential propagation widely regarded as the core temporal modeling mechanism. However, existing research lack analysis of its actual contributions in complex temporal tasks. We design Non-Stateful (NS) models progressively removing membrane propagation to quantify its stage-wise role. Experiments reveal a counterintuitive phenomenon: moderate removal in shallow or deep layers improves performance, while excessive removal causes collapse. We attribute this to spatio-temporal resource competition where neurons encode both semantics and dynamics within limited range, with temporal state consuming capacity for spatial learning. Based on this, we propose Spatial-Temporal Separable Network (STSep), decoupling residual blocks into independent spatial and temporal branches. The spatial branch focuses on semantic extraction while the temporal branch captures motion through explicit temporal differences. Experiments on Something-Something V2, UCF101, and HMDB51 show STSep achieves superior performance, with retrieval task and attention analysis confirming focus on motion rather than static appearance. This work provides new perspectives on SNNs' temporal mechanisms and an effective solution for spatiotemporal modeling in video understanding.

Paper Structure

This paper contains 28 sections, 9 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: The spatiotemporal separable network. Separating residual blocks into non-stateful spatial and temporal branches.
  • Figure 2: Illustration of NS models, demonstrating forward (NS) and reverse (rNS) strategies for progressive removal of temporal state across layers.
  • Figure 3: Detailed illustration of the STSep module, where NSN denotes Non-Stateful Neuron. DS/DC scale represents downsampling and channel scaling operations.
  • Figure 4: Comparison of stage-wise ablation. The x-axis represents different stages of separation, while the y-axis shows Top-1 accuracy. a) NS/rNS model comparison with various configurations. b) STSep model comparison in state vs. nostate.
  • Figure 5: Activation visualization.Top shows vanilla model activation for sample inputs. Bottom shows STSep model activation for the same samples.
  • ...and 4 more figures