SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

Xiao Wang; Yao Rong; Zongzhen Wu; Lin Zhu; Bo Jiang; Jin Tang; Yonghong Tian

SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

Xiao Wang, Yao Rong, Zongzhen Wu, Lin Zhu, Bo Jiang, Jin Tang, Yonghong Tian

TL;DR

The paper addresses RGB-Event pattern recognition by fusing RGB frames with raw event streams to overcome limitations of single-modality approaches. It proposes SSTFormer, a four-module framework combining a memory-equipped RGB encoder (MST), an energy-efficient SCNN for raw events, a multi-modal bottleneck fusion (MBF), and a prediction head, with an optional dual-Transformer configuration that pairs SpikingFormer with MST. A large-scale PokerEvent dataset (114 classes; 27102 frame-event pairs) is introduced to support RGB-Event evaluation. Experimental results on PokerEvent and HARDVS demonstrate competitive accuracy and favorable energy characteristics, with ablations validating the contributions of MST, SCNN, and MBF. The work advances practical RGB-Event pattern recognition and provides datasets and code to facilitate further research and development.

Abstract

Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, seldom of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at https://github.com/Event-AHU/SSTFormer

SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

TL;DR

Abstract

Paper Structure (18 sections, 7 equations, 13 figures, 5 tables)

This paper contains 18 sections, 7 equations, 13 figures, 5 tables.

Introduction
Related Work
Event-based Classification
Spiking Neural Networks
Transformer Networks
Methodology
Overview
Input Representation
Network Architecture
Dual-Transformer based Version
Experiments
Dataset and Evaluation Metric
Implementation Details
Comparison with Other SOTA Models
Ablation Study
...and 3 more sections

Figures (13)

Figure 1: Illustration of video classification by fusing RGB frames and event stream. The event images (bottom row) are stacked for visualization.
Figure 2: An overview of our proposed multi-modal fusion framework for RGB-Event based pattern recognition. We propose a novel SCNN (Spiking Convolutional Neural Network) which is a hybrid SNN-ANN network to directly encode the raw event streams instead of pre-processing them into the expression of intermediate form. It achieves a better trade-off between the energy-consumption and overall recognition results. For the RGB input, we design a novel memory support Transformer (MST) network to learn the spatial-temporal information. We first divide the video frames into multiple clips, then, we treat the last frame in a clip as the query and the rest as support frames, and conduct support-query interactive learning using the cross-attention mechanism. The output of SCNN and MST modules are fused with bottleneck feature maps which are then fed into the prediction head for recognition.
Figure 3: Comparison between the widely used ANN perceptron, LIF, and IF spiking neuron.
Figure 4: An illustration of SpikingFormer block zhou2023spikingformer.
Figure 5: Distribution of the number of RGB-Event samples of each category in our newly proposed PokerEvent dataset.
...and 8 more figures

SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

TL;DR

Abstract

SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (13)