Table of Contents
Fetching ...

Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms

Xiao Wang, Shiao Wang, Pengpeng Shao, Bo Jiang, Lin Zhu, Yonghong Tian

TL;DR

This work tackles the limitations of RGB-based HAR by leveraging high-definition event cameras to improve robustness under challenging lighting and motion. It introduces CeleX-HAR, a large-scale dataset with $124{,}625$ sequences across $150$ actions at $1280 \times 800$, and presents a diverse evaluation protocol and extensive baselines. The authors also propose EVMamba, a VMamba-based backbone with a novel spatial-temporal scanning mechanism and voxel-temporal scanning to efficiently model long-range dependencies in event streams. Empirical results on CeleX-HAR and other benchmarks demonstrate competitive performance, and the dataset and code are released to foster future research in high-definition event-based HAR.

Abstract

Human Action Recognition (HAR) stands as a pivotal research domain in both computer vision and artificial intelligence, with RGB cameras dominating as the preferred tool for investigation and innovation in this field. However, in real-world applications, RGB cameras encounter numerous challenges, including light conditions, fast motion, and privacy concerns. Consequently, bio-inspired event cameras have garnered increasing attention due to their advantages of low energy consumption, high dynamic range, etc. Nevertheless, most existing event-based HAR datasets are low resolution ($346 \times 260$). In this paper, we propose a large-scale, high-definition ($1280 \times 800$) human action recognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It encompasses 150 commonly occurring action categories, comprising a total of 124,625 video sequences. Various factors such as multi-view, illumination, action speed, and occlusion are considered when recording these data. To build a more comprehensive benchmark dataset, we report over 20 mainstream HAR models for future works to compare. In addition, we also propose a novel Mamba vision backbone network for event stream based HAR, termed EVMamba, which equips the spatial plane multi-directional scanning and novel voxel temporal scanning mechanism. By encoding and mining the spatio-temporal information of event streams, our EVMamba has achieved favorable results across multiple datasets. Both the dataset and source code will be released on \url{https://github.com/Event-AHU/CeleX-HAR}

Event Stream based Human Action Recognition: A High-Definition Benchmark Dataset and Algorithms

TL;DR

This work tackles the limitations of RGB-based HAR by leveraging high-definition event cameras to improve robustness under challenging lighting and motion. It introduces CeleX-HAR, a large-scale dataset with sequences across actions at , and presents a diverse evaluation protocol and extensive baselines. The authors also propose EVMamba, a VMamba-based backbone with a novel spatial-temporal scanning mechanism and voxel-temporal scanning to efficiently model long-range dependencies in event streams. Empirical results on CeleX-HAR and other benchmarks demonstrate competitive performance, and the dataset and code are released to foster future research in high-definition event-based HAR.

Abstract

Human Action Recognition (HAR) stands as a pivotal research domain in both computer vision and artificial intelligence, with RGB cameras dominating as the preferred tool for investigation and innovation in this field. However, in real-world applications, RGB cameras encounter numerous challenges, including light conditions, fast motion, and privacy concerns. Consequently, bio-inspired event cameras have garnered increasing attention due to their advantages of low energy consumption, high dynamic range, etc. Nevertheless, most existing event-based HAR datasets are low resolution (). In this paper, we propose a large-scale, high-definition () human action recognition dataset based on the CeleX-V event camera, termed CeleX-HAR. It encompasses 150 commonly occurring action categories, comprising a total of 124,625 video sequences. Various factors such as multi-view, illumination, action speed, and occlusion are considered when recording these data. To build a more comprehensive benchmark dataset, we report over 20 mainstream HAR models for future works to compare. In addition, we also propose a novel Mamba vision backbone network for event stream based HAR, termed EVMamba, which equips the spatial plane multi-directional scanning and novel voxel temporal scanning mechanism. By encoding and mining the spatio-temporal information of event streams, our EVMamba has achieved favorable results across multiple datasets. Both the dataset and source code will be released on \url{https://github.com/Event-AHU/CeleX-HAR}
Paper Structure (21 sections, 5 equations, 8 figures, 8 tables)

This paper contains 21 sections, 5 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison of existing Event datasets (a). ASL-DVS, (b). DVS128-Gait-Day, (c). N-Caltch101, and (d) our newly proposed CeleX-HAR dataset.
  • Figure 2: An overview of our proposed EVMamba for event-based human action recognition. Given the event streams, we design a novel spatial-temporal scanning mechanism that mines both spatial tokens and temporal voxel tokens as the input of the Mamba network. Enhanced by the temporal voxel scanning strategy, our EVMamba outperforms the baseline on multiple event-based classification benchmark datasets.
  • Figure 3: Information of (a). the number of videos in each category of CeleX-HAR, and (b). Comparison between existing Event-based recognition datasets.
  • Figure 4: More illustrations of event samples in our CeleX-HAR dataset.
  • Figure 5: The fusion method of matrix interaction.
  • ...and 3 more figures