Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition

Lan Chen; Dong Li; Xiao Wang; Pengpeng Shao; Wei Zhang; Yaowei Wang; Yonghong Tian; Jin Tang

Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition

Lan Chen, Dong Li, Xiao Wang, Pengpeng Shao, Wei Zhang, Yaowei Wang, Yonghong Tian, Jin Tang

TL;DR

This paper addresses the challenge of robust event-stream recognition by recognizing that single representations (images or voxels) limit feature expressiveness. It introduces EFV++, a dual-stream framework that processes event frames with a Transformer and event voxels with a GNN, combined through a quality-aware Retain-Blend-Exchange fusion and a bottleneck Transformer, followed by a GRU-based hybrid readout. The approach achieves state-of-the-art results on multiple benchmarks, including a new best 90.51% top-1 on Bullying10k, and demonstrates strong cross-dataset generalization. The work offers a scalable, multi-view fusion paradigm that leverages both spatial-temporal and 3D stereo information, with practical implications for real-time event-based recognition and potential for hardware-friendly distillation in future work.

Abstract

Existing event stream-based pattern recognition models usually represent the event stream as the point cloud, voxel, image, etc., and design various deep neural networks to learn their features. Although considerable results can be achieved in simple cases, however, the model performance may be limited by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this paper, we propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be learned separately by utilizing Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be fed into the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on multiple widely used event stream-based classification datasets. Specifically, we achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51\%$, which exceeds the second place by $+2.21\%$. The source code of this paper has been released on \url{https://github.com/Event-AHU/EFV_event_classification/tree/EFVpp}.

Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition

TL;DR

Abstract

, which exceeds the second place by

. The source code of this paper has been released on \url{https://github.com/Event-AHU/EFV_event_classification/tree/EFVpp}.

Paper Structure (20 sections, 3 equations, 7 figures, 6 tables)

This paper contains 20 sections, 3 equations, 7 figures, 6 tables.

Introduction
Related Works
Event Stream-based Recognition
Multi-View/-Modality based Recognition
Transformer Network
Our Proposed Approach
Overview
Input Representation
Backbone Networks
Quality-aware RBE Module
Hybrid Interaction Readout Mechanism
Classification Head & Loss Function
Experiments
Datasets and Evaluation Metrics
Implementation Details
...and 5 more sections

Figures (7)

Figure 1: Comparison of the frame- and event stream-based cameras (https://youtu.be/6xOmo7Ikwzk). (a, b) shows representative samples in regular scenarios, low-illumination (L.I.), and fast motion (F.M.). (c, d) illustrates the different types of raw data representation of frame- and event stream-based cameras.
Figure 2: An illustration of our proposed EFV++ framework which takes the diverse event stream representations as the input, i.e., the event frame and voxel. We adopt the Transformer and graph neural networks to handle the event frames and event voxels respectively, and fuse the dual stream in a differentiable manner. To be specific, we keep high-quality features, remove redundant and ineffective features, and integrate general features. Then, we combine the output features from the two branches with bottleneck features, arrange them, and then use a GRU network for fusion to obtain a more diverse feature representation. Finally, we feed this feature into the classification head for effective pattern recognition.
Figure 3: Visualization of similarity matrix learned by ST-Transformer in the RBE module. The $2^{th}$ and $5^{th}$ columns are raw attention matrix, and the $3^{th}$ and $6^{th}$ columns are superimposed images of the input image and the attention map after resizing.
Figure 4: Analysis of key thresholds for recognition on N-Caltech101 dataset.
Figure 5: Parameter analysis of core modules in our framework.
...and 2 more figures

Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition

TL;DR

Abstract

Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)