S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Naichuan Zheng; Hailun Xia; Zepeng Sun; Weiyi Li; Yujia Wang

S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yujia Wang

Abstract

Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.

S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Abstract

Paper Structure (32 sections, 12 equations, 10 figures, 5 tables)

This paper contains 32 sections, 12 equations, 10 figures, 5 tables.

Introduction
Related Work
Skeleton-based Action Recognition
Spiking Neural Networks
State-Space Models and Temporal Dynamics
Method
Preliminaries: Spiking Neuron Dynamics
Generalized Kinematic Differential Operator: M-ASE
Spiking State-Space Topology Block (S3T Block)
Asymmetric Temporal-Gradient QKV (ATG-QKV)
Spiking State-Space Topology Attention
Overall Architecture and Membrane Readout
Experiments
Comparison with State-of-the-Art Architectures
Ablation Studies
...and 17 more sections

Figures (10)

Figure 1: Overall architecture of the proposed Spiking State-Space Topology Transformer (S3T-Former). The input skeletal coordinates are first processed by the Multi-Stream Anatomical Spiking Embedding (M-ASE) to generate rich, multi-order (identity, spatial, and temporal) event streams. The core S3T Block employs a Spiking State-Space Topology Attention (S3T-Attn) module and a Spiking MLP. A non-spiking IF Integrator ($U_{th}=\infty$) serves as a lossless membrane potential readout, optimized directly with Temporal Efficient Training (TET) loss.
Figure 2: Lateral Spiking Topology Routing (LSTR). It decouples spatial anatomy into multi-head pathways, executing zero-MAC spatial feature broadcasts via conditional sparse additions.
Figure 3: Spiking State-Space (S3) Engine. It constructs a linear-complexity temporal memory pool to integrate global spatio-temporal context without dense multiplications.
Figure 4: Visualization of the dynamic topology matrix $\mathbf{A}_{dyn}^{(h)}$ (Eq. \ref{['eq:dynamic_topology']}) across different network depths for head $h=8$.
Figure 5: Target vs. Competitor dynamics. U-Readout avoids the staircase effect of Spike Counting.
...and 5 more figures

S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Abstract

S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Authors

Abstract

Table of Contents

Figures (10)