Table of Contents
Fetching ...

Rethinking Efficient and Effective Point-based Networks for Event Camera Classification and Regression: EventMamba

Hongwei Ren, Yue Zhou, Jiadong Zhu, Haotian Fu, Yulong Huang, Xiaopeng Lin, Yuetong Fang, Fei Ma, Hao Yu, Bojun Cheng

TL;DR

EventMamba rethinks how to process event camera data by treating events as a temporally ordered 3D pseudo-Point Cloud and applying a three-stage hierarchy (LocalFE, Temporal Aggregation, GlobalFE) powered by a fast State Space Model backbone (Mamba). By preserving explicit temporal information and using attention-based temporal aggregation, it achieves state-of-the-art results among point-based methods on six action-recognition scales and outperforms frame-based counterparts on CPR and eye-tracking tasks while remaining highly efficient. The approach combines lightweight preprocessing, a hardware-friendly Mamba block, and task-specific losses to deliver robust performance with minimal compute, enabling edge deployment. Overall, EventMamba demonstrates the viability of sparse, temporally enriched point-cloud processing for diverse event-based perception tasks, offering significant speedups and accuracy gains over traditional frame-based pipelines.

Abstract

Event cameras draw inspiration from biological systems, boasting low latency and high dynamic range while consuming minimal power. The most current approach to processing Event Cloud often involves converting it into frame-based representations, which neglects the sparsity of events, loses fine-grained temporal information, and increases the computational burden. In contrast, Point Cloud is a popular representation for processing 3-dimensional data and serves as an alternative method to exploit local and global spatial features. Nevertheless, previous point-based methods show an unsatisfactory performance compared to the frame-based method in dealing with spatio-temporal event streams. In order to bridge the gap, we propose EventMamba, an efficient and effective framework based on Point Cloud representation by rethinking the distinction between Event Cloud and Point Cloud, emphasizing vital temporal information. The Event Cloud is subsequently fed into a hierarchical structure with staged modules to process both implicit and explicit temporal features. Specifically, we redesign the global extractor to enhance explicit temporal extraction among a long sequence of events with temporal aggregation and State Space Model (SSM) based Mamba. Our model consumes minimal computational resources in the experiments and still exhibits SOTA point-based performance on six different scales of action recognition datasets. It even outperformed all frame-based methods on both Camera Pose Relocalization (CPR) and eye-tracking regression tasks. Our code is available at: https://github.com/rhwxmx/EventMamba.

Rethinking Efficient and Effective Point-based Networks for Event Camera Classification and Regression: EventMamba

TL;DR

EventMamba rethinks how to process event camera data by treating events as a temporally ordered 3D pseudo-Point Cloud and applying a three-stage hierarchy (LocalFE, Temporal Aggregation, GlobalFE) powered by a fast State Space Model backbone (Mamba). By preserving explicit temporal information and using attention-based temporal aggregation, it achieves state-of-the-art results among point-based methods on six action-recognition scales and outperforms frame-based counterparts on CPR and eye-tracking tasks while remaining highly efficient. The approach combines lightweight preprocessing, a hardware-friendly Mamba block, and task-specific losses to deliver robust performance with minimal compute, enabling edge deployment. Overall, EventMamba demonstrates the viability of sparse, temporally enriched point-cloud processing for diverse event-based perception tasks, offering significant speedups and accuracy gains over traditional frame-based pipelines.

Abstract

Event cameras draw inspiration from biological systems, boasting low latency and high dynamic range while consuming minimal power. The most current approach to processing Event Cloud often involves converting it into frame-based representations, which neglects the sparsity of events, loses fine-grained temporal information, and increases the computational burden. In contrast, Point Cloud is a popular representation for processing 3-dimensional data and serves as an alternative method to exploit local and global spatial features. Nevertheless, previous point-based methods show an unsatisfactory performance compared to the frame-based method in dealing with spatio-temporal event streams. In order to bridge the gap, we propose EventMamba, an efficient and effective framework based on Point Cloud representation by rethinking the distinction between Event Cloud and Point Cloud, emphasizing vital temporal information. The Event Cloud is subsequently fed into a hierarchical structure with staged modules to process both implicit and explicit temporal features. Specifically, we redesign the global extractor to enhance explicit temporal extraction among a long sequence of events with temporal aggregation and State Space Model (SSM) based Mamba. Our model consumes minimal computational resources in the experiments and still exhibits SOTA point-based performance on six different scales of action recognition datasets. It even outperformed all frame-based methods on both Camera Pose Relocalization (CPR) and eye-tracking regression tasks. Our code is available at: https://github.com/rhwxmx/EventMamba.
Paper Structure (39 sections, 18 equations, 5 figures, 13 tables, 1 algorithm)

This paper contains 39 sections, 18 equations, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: The gain in three datasets is visualized. EventMamba achieves SOTA in the point-based method and maintains very high efficiency. The shapes illustrate the results corresponding to different data formats, with varying colors denoting Floating Point Operations (FLOPs).
  • Figure 2: There are two distinct approaches for managing Event Clouds: $(a)$ frame-based methods and $(b)$ point-based methods. Frame-based techniques involve the condensation of a temporal span of events into a grayscale image using techniques such as SAE, LIF, and others miao2019neuromorphic.
  • Figure 3: EventMamba accomplishes two tasks by processing the Event Cloud to a sequence of distinct modules, downsampling, hierarchy structure, and classifier or regressor. More specifically, the $LocalFE$ is responsible for the extraction of local geometric features, while the $GlobalFE$ plays a pivotal role in elevating the dimensionality of the extracted features and abstracting higher-level global and explicit temporal features.
  • Figure 4: Spatial and temporal aggregation. The different colors represent the features of events at different times, with gray representing the feature in $[t_5,t_{k-1}]$.
  • Figure 5: Event-based action recognition, CPR and eye tracking datasets visualization.