Table of Contents
Fetching ...

Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset

Shiao Wang, Xiao Wang, Bo Jiang, Lin Zhu, Guoqi Li, Yaowei Wang, Yonghong Tian, Jin Tang

TL;DR

This work tackles robust human activity recognition under challenging conditions by fusing RGB and event camera data. It introduces HARDVS 2.0, a large-scale multi-modal HAR benchmark with 107,646 paired videos spanning 300 real-world actions, and MMHCO-HAR, a physics-inspired heat conduction backbone that fuses RGB and event features via modality specific Frequency Value Embeddings and a DCT-IDCT based diffusion mechanism. A policy routing fusion mechanism adaptively selects among Modal Complementary, Discriminative, and Specific fusion paths, enabling robust multi-modal integration with improved efficiency. Experimental results on HARDVS 2.0 and PokerEvent demonstrate competitive accuracy gains (eg, 53.2% top-1 on HARDVS 2.0) and favorable efficiency, highlighting the practicality of physics-informed multi-modal HAR for real-world, low-light, and fast-motion scenarios, while providing a valuable dataset for future research.

Abstract

Human Activity Recognition (HAR) primarily relied on traditional RGB cameras to achieve high-performance activity recognition. However, the challenging factors in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras. To address these challenges, biologically inspired event cameras offer a promising solution to overcome the limitations of traditional RGB cameras. In this work, we rethink human activity recognition by combining the RGB and event cameras. The first contribution is the proposed large-scale multi-modal RGB-Event human activity recognition benchmark dataset, termed HARDVS 2.0, which bridges the dataset gaps. It contains 300 categories of everyday real-world actions with a total of 107,646 paired videos covering various challenging scenarios. Inspired by the physics-informed heat conduction model, we propose a novel multi-modal heat conduction operation framework for effective activity recognition, termed MMHCO-HAR. More in detail, given the RGB frames and event streams, we first extract the feature embeddings using a stem network. Then, multi-modal Heat Conduction blocks are designed to fuse the dual features, the key module of which is the multi-modal Heat Conduction Operation layer. We integrate RGB and event embeddings through a multi-modal DCT-IDCT layer while adaptively incorporating the thermal conductivity coefficient via FVEs into this module. After that, we propose an adaptive fusion module based on a policy routing strategy for high-performance classification. Comprehensive experiments demonstrate that our method consistently performs well, validating its effectiveness and robustness. The source code and benchmark dataset will be released on https://github.com/Event-AHU/HARDVS/tree/HARDVSv2

Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset

TL;DR

This work tackles robust human activity recognition under challenging conditions by fusing RGB and event camera data. It introduces HARDVS 2.0, a large-scale multi-modal HAR benchmark with 107,646 paired videos spanning 300 real-world actions, and MMHCO-HAR, a physics-inspired heat conduction backbone that fuses RGB and event features via modality specific Frequency Value Embeddings and a DCT-IDCT based diffusion mechanism. A policy routing fusion mechanism adaptively selects among Modal Complementary, Discriminative, and Specific fusion paths, enabling robust multi-modal integration with improved efficiency. Experimental results on HARDVS 2.0 and PokerEvent demonstrate competitive accuracy gains (eg, 53.2% top-1 on HARDVS 2.0) and favorable efficiency, highlighting the practicality of physics-informed multi-modal HAR for real-world, low-light, and fast-motion scenarios, while providing a valuable dataset for future research.

Abstract

Human Activity Recognition (HAR) primarily relied on traditional RGB cameras to achieve high-performance activity recognition. However, the challenging factors in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras. To address these challenges, biologically inspired event cameras offer a promising solution to overcome the limitations of traditional RGB cameras. In this work, we rethink human activity recognition by combining the RGB and event cameras. The first contribution is the proposed large-scale multi-modal RGB-Event human activity recognition benchmark dataset, termed HARDVS 2.0, which bridges the dataset gaps. It contains 300 categories of everyday real-world actions with a total of 107,646 paired videos covering various challenging scenarios. Inspired by the physics-informed heat conduction model, we propose a novel multi-modal heat conduction operation framework for effective activity recognition, termed MMHCO-HAR. More in detail, given the RGB frames and event streams, we first extract the feature embeddings using a stem network. Then, multi-modal Heat Conduction blocks are designed to fuse the dual features, the key module of which is the multi-modal Heat Conduction Operation layer. We integrate RGB and event embeddings through a multi-modal DCT-IDCT layer while adaptively incorporating the thermal conductivity coefficient via FVEs into this module. After that, we propose an adaptive fusion module based on a policy routing strategy for high-performance classification. Comprehensive experiments demonstrate that our method consistently performs well, validating its effectiveness and robustness. The source code and benchmark dataset will be released on https://github.com/Event-AHU/HARDVS/tree/HARDVSv2

Paper Structure

This paper contains 27 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: (a). Comparison between existing datasets and our proposed HARDVS 2.0 dataset for video classification. (b). A simple schematic diagram of our framework.
  • Figure 2: An overview of our proposed MMHCO-HAR framework for RGB-Event based human activity recognition (HAR). A multi-modal visual heat conduction model is introduced to effectively integrate data from both RGB and event modalities to achieve robust HAR. Specifically, we propose modality-specific continuous Frequency Value Embeddings to capture the unique characteristics of each modality and enhance information interaction between multi-modal heat conduction blocks. Additionally, we introduce a policy routing based fusion method to adaptively fuse multi-modal information, ensuring optimized performance across diverse scenarios.
  • Figure 3: The adaptive fusion module based on policy routing strategy, i.e., Modal Complementary Fusion (MCF), Modal Discriminative Fusion (MDF), and Modal Specific Fusion (MSF).
  • Figure 4: Illustration of representative video clips in our HARDVS 2.0 dataset.
  • Figure 5: (a). The impact of the number of frames on accuracy; (b). The impact of input image resolution on accuracy.
  • ...and 4 more figures