Table of Contents
Fetching ...

ReSpike: Residual Frames-based Hybrid Spiking Neural Networks for Efficient Action Recognition

Shiting Xiao, Yuhang Li, Youngeun Kim, Donghyun Lee, Priyadarshini Panda

TL;DR

ReSpike addresses the challenge of energy-efficient action recognition by integrating Spiking Neural Networks (SNNs) with Artificial Neural Networks (ANNs) through a novel Key-Residual input representation. The method assigns spatial learning to the ANN via RGB Key Frames and temporal dynamics to the SNN via Residual Frames, then fuses them with a multi-scale cross-attention mechanism. It achieves state-of-the-art or competitive accuracy on HMDB-51, UCF-101, and Kinetics-400, including a first direct-SNN result on Kinetics-400 with 70.1% accuracy, while delivering favorable energy-accuracy trade-offs (up to ~6.8x energy reduction over 3D CNN baselines). The approach is trained end-to-end with Spatio-Temporal Back-Propagation and surrogate gradients, and ablation studies with attention-map visualizations corroborate the effectiveness of key-residual representations and cross-modal fusion for dynamic scene understanding.

Abstract

Spiking Neural Networks (SNNs) have emerged as a compelling, energy-efficient alternative to traditional Artificial Neural Networks (ANNs) for static image tasks such as image classification and segmentation. However, in the more complex video classification domain, SNN-based methods fall considerably short of ANN-based benchmarks due to the challenges in processing dense frame sequences. To bridge this gap, we propose ReSpike, a hybrid framework that synergizes the strengths of ANNs and SNNs to tackle action recognition tasks with high accuracy and low energy cost. By decomposing film clips into spatial and temporal components, i.e., RGB image Key Frames and event-like Residual Frames, ReSpike leverages ANN for learning spatial information and SNN for learning temporal information. In addition, we propose a multi-scale cross-attention mechanism for effective feature fusion. Compared to state-of-the-art SNN baselines, our ReSpike hybrid architecture demonstrates significant performance improvements (e.g., >30% absolute accuracy improvement on HMDB-51, UCF-101, and Kinetics-400). Furthermore, ReSpike achieves comparable performance with prior ANN approaches while bringing better accuracy-energy tradeoff.

ReSpike: Residual Frames-based Hybrid Spiking Neural Networks for Efficient Action Recognition

TL;DR

ReSpike addresses the challenge of energy-efficient action recognition by integrating Spiking Neural Networks (SNNs) with Artificial Neural Networks (ANNs) through a novel Key-Residual input representation. The method assigns spatial learning to the ANN via RGB Key Frames and temporal dynamics to the SNN via Residual Frames, then fuses them with a multi-scale cross-attention mechanism. It achieves state-of-the-art or competitive accuracy on HMDB-51, UCF-101, and Kinetics-400, including a first direct-SNN result on Kinetics-400 with 70.1% accuracy, while delivering favorable energy-accuracy trade-offs (up to ~6.8x energy reduction over 3D CNN baselines). The approach is trained end-to-end with Spatio-Temporal Back-Propagation and surrogate gradients, and ablation studies with attention-map visualizations corroborate the effectiveness of key-residual representations and cross-modal fusion for dynamic scene understanding.

Abstract

Spiking Neural Networks (SNNs) have emerged as a compelling, energy-efficient alternative to traditional Artificial Neural Networks (ANNs) for static image tasks such as image classification and segmentation. However, in the more complex video classification domain, SNN-based methods fall considerably short of ANN-based benchmarks due to the challenges in processing dense frame sequences. To bridge this gap, we propose ReSpike, a hybrid framework that synergizes the strengths of ANNs and SNNs to tackle action recognition tasks with high accuracy and low energy cost. By decomposing film clips into spatial and temporal components, i.e., RGB image Key Frames and event-like Residual Frames, ReSpike leverages ANN for learning spatial information and SNN for learning temporal information. In addition, we propose a multi-scale cross-attention mechanism for effective feature fusion. Compared to state-of-the-art SNN baselines, our ReSpike hybrid architecture demonstrates significant performance improvements (e.g., >30% absolute accuracy improvement on HMDB-51, UCF-101, and Kinetics-400). Furthermore, ReSpike achieves comparable performance with prior ANN approaches while bringing better accuracy-energy tradeoff.
Paper Structure (15 sections, 6 equations, 5 figures, 4 tables)

This paper contains 15 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Accuracy and energy consumption for ANN and SNN models on (a) ImageNet, (b) HMDB-51, and (c) UCF-101 datasets. While SOTA SNN models demonstrate comparable accuracy and reduced cost on static ImageNet, they still face a significant performance shortfall compared to ANNs on dynamic action datasets.
  • Figure 2: ReSpike architecture. We use RGB key frames and event-like residual frames as inputs to our ReSpike model, which consists of an SNN branch, an ANN branch, and four cross-attention modules. The ANN branch output $\mathbf{F}_A$ and SNN branch output $\mathbf{F}_S$ will be combined in the classification head for final prediction.
  • Figure 3: Cross-attention fusion module is built upon a standard Transformer block with an additional cross-attention layer, where Query map is associated with SNN features and Key and Value maps are associated with ANN features. Colors indicating the network modules and features are consistent with those in Fig. \ref{['fig:network-arch']}.
  • Figure 4: Attention maps. This example belongs to action class "Apply Eye Makeup" from UCF-101. For the red query token associated with the residual frames, we show the top-5 attended key tokens associated with the key frame. ReSpike attends to regions like hand, eye, and makeup brush which are essential for classification.
  • Figure 5: Influence of the stride $s$.