Table of Contents
Fetching ...

S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yujia Wang

Abstract

Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.

S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Abstract

Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.
Paper Structure (32 sections, 12 equations, 10 figures, 5 tables)

This paper contains 32 sections, 12 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overall architecture of the proposed Spiking State-Space Topology Transformer (S3T-Former). The input skeletal coordinates are first processed by the Multi-Stream Anatomical Spiking Embedding (M-ASE) to generate rich, multi-order (identity, spatial, and temporal) event streams. The core S3T Block employs a Spiking State-Space Topology Attention (S3T-Attn) module and a Spiking MLP. A non-spiking IF Integrator ($U_{th}=\infty$) serves as a lossless membrane potential readout, optimized directly with Temporal Efficient Training (TET) loss.
  • Figure 2: Lateral Spiking Topology Routing (LSTR). It decouples spatial anatomy into multi-head pathways, executing zero-MAC spatial feature broadcasts via conditional sparse additions.
  • Figure 3: Spiking State-Space (S3) Engine. It constructs a linear-complexity temporal memory pool to integrate global spatio-temporal context without dense multiplications.
  • Figure 4: Visualization of the dynamic topology matrix $\mathbf{A}_{dyn}^{(h)}$ (Eq. \ref{['eq:dynamic_topology']}) across different network depths for head $h=8$.
  • Figure 5: Target vs. Competitor dynamics. U-Readout avoids the staircase effect of Spike Counting.
  • ...and 5 more figures