Table of Contents
Fetching ...

SpikMamba: When SNN meets Mamba in Event-based Human Action Recognition

Jiaqi Chen, Yan Yang, Shizhuo Deng, Da Teng, Liyuan Pan

TL;DR

The SpikMamba framework, which combines the energy efficiency of spiking neural networks and the long sequence modeling capability of Mamba to efficiently capture global features from spatially sparse and high a temporal resolution event data, is proposed.

Abstract

Human action recognition (HAR) plays a key role in various applications such as video analysis, surveillance, autonomous driving, robotics, and healthcare. Most HAR algorithms are developed from RGB images, which capture detailed visual information. However, these algorithms raise concerns in privacy-sensitive environments due to the recording of identifiable features. Event cameras offer a promising solution by capturing scene brightness changes sparsely at the pixel level, without capturing full images. Moreover, event cameras have high dynamic ranges that can effectively handle scenarios with complex lighting conditions, such as low light or high contrast environments. However, using event cameras introduces challenges in modeling the spatially sparse and high temporal resolution event data for HAR. To address these issues, we propose the SpikMamba framework, which combines the energy efficiency of spiking neural networks and the long sequence modeling capability of Mamba to efficiently capture global features from spatially sparse and high a temporal resolution event data. Additionally, to improve the locality of modeling, a spiking window-based linear attention mechanism is used. Extensive experiments show that SpikMamba achieves remarkable recognition performance, surpassing the previous state-of-the-art by 1.45%, 7.22%, 0.15%, and 3.92% on the PAF, HARDVS, DVS128, and E-FAction datasets, respectively. The code is available at https://github.com/Typistchen/SpikMamba.

SpikMamba: When SNN meets Mamba in Event-based Human Action Recognition

TL;DR

The SpikMamba framework, which combines the energy efficiency of spiking neural networks and the long sequence modeling capability of Mamba to efficiently capture global features from spatially sparse and high a temporal resolution event data, is proposed.

Abstract

Human action recognition (HAR) plays a key role in various applications such as video analysis, surveillance, autonomous driving, robotics, and healthcare. Most HAR algorithms are developed from RGB images, which capture detailed visual information. However, these algorithms raise concerns in privacy-sensitive environments due to the recording of identifiable features. Event cameras offer a promising solution by capturing scene brightness changes sparsely at the pixel level, without capturing full images. Moreover, event cameras have high dynamic ranges that can effectively handle scenarios with complex lighting conditions, such as low light or high contrast environments. However, using event cameras introduces challenges in modeling the spatially sparse and high temporal resolution event data for HAR. To address these issues, we propose the SpikMamba framework, which combines the energy efficiency of spiking neural networks and the long sequence modeling capability of Mamba to efficiently capture global features from spatially sparse and high a temporal resolution event data. Additionally, to improve the locality of modeling, a spiking window-based linear attention mechanism is used. Extensive experiments show that SpikMamba achieves remarkable recognition performance, surpassing the previous state-of-the-art by 1.45%, 7.22%, 0.15%, and 3.92% on the PAF, HARDVS, DVS128, and E-FAction datasets, respectively. The code is available at https://github.com/Typistchen/SpikMamba.

Paper Structure

This paper contains 12 sections, 12 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of (a) ANN methods, (b) SNN methods, and (c) ours for event-based HAR. (a) ANN methods downsample the event data temporally to reduce the heavy computation, and use an attention mechanism, convolutional neural network (CNN), and graph neural network (GCN) to extract features from spatially sparse events, which achieves high performance. However, fine-grained information about human actions, which could improve the model performance, is lost. (b) SNN methods use SpikRNN and SpikGCN to effectively extract features from spatially sparse events, however, the computations are usually restricted to local temporal contexts, which leads to a loss of global temporal dependency for accurately recognizing human actions. (c) We combine Mamba and window-based linear attention into SNN to efficiently model the global and local temporal dependencies of the event data, and accurately recognize human actions.
  • Figure 2: The overview of SpikMamba. We represent event data as three channel event images, and predict the action class of the event data with two modules: 1) Spiking 3D patch embedding. It divides event frames into patches and project the patches to spike-form features. 2) SpikMamba blocks. It consists of a window-based spike linear attention layer and a spike Mamba layer to model the local and global temporal dependencies of the event data. We show the architecture of the SpikMamba block at the top right, and the architecture of the spike linear attention layer and the spike Mamba layer in the bottom right. Finally, through the prediction layer, the embedding of the last SpikMamba block is pooled and projected to the action class.
  • Figure 3: Examples of event data. From the first to last rows, they are event frames of 'throw' in the PAF dataset, 'type clamp back' in the HARDVS dataset, 'right arm clockwise' in the DVSGesture dataset and 'slow walking' in the E-FAction dataset.
  • Figure 4: Attention map examples of SpikMamba. High-attention regions are marked in white, and low-attention regions are marked in black. The attention map indicates that our SpikMamba focuses on image regions with human action effectively. Please view in colour on the screen.