Table of Contents
Fetching ...

MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide

TL;DR

This paper tackles real-world action recognition in wide-area, multi-view environments by introducing the MultiSensor-Home dataset with frame-level annotations and a Transformer-based fusion method, MultiTSF. MultiTSF jointly processes synchronized audio and high-resolution visual data from multiple views, uses a Human Detection Module to emphasize human-centric frames, and employs temporal Transformers to model cross-view and temporal dependencies. Empirical results on MultiSensor-Home and MM-Office show that MultiTSF surpasses state-of-the-art methods in both macro- and micro-averaged metrics, with qualitative analyses illustrating interpretable attention patterns across views and time. The work advances practical, robust multi-modal action recognition for smart-home and surveillance applications and provides publicly available code for reproducibility.

Abstract

Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area distributed settings, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this paper, we introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments, and also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method. The proposed MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the proposed method integrates a human detection module to enhance spatial feature learning, guiding the model to prioritize frames with human activity to enhance action the recognition accuracy. Experiments on the proposed MultiSensor-Home and the existing MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. Quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition. The source code is available at https://github.com/thanhhff/MultiTSF.

MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion

TL;DR

This paper tackles real-world action recognition in wide-area, multi-view environments by introducing the MultiSensor-Home dataset with frame-level annotations and a Transformer-based fusion method, MultiTSF. MultiTSF jointly processes synchronized audio and high-resolution visual data from multiple views, uses a Human Detection Module to emphasize human-centric frames, and employs temporal Transformers to model cross-view and temporal dependencies. Empirical results on MultiSensor-Home and MM-Office show that MultiTSF surpasses state-of-the-art methods in both macro- and micro-averaged metrics, with qualitative analyses illustrating interpretable attention patterns across views and time. The work advances practical, robust multi-modal action recognition for smart-home and surveillance applications and provides publicly available code for reproducibility.

Abstract

Multi-modal multi-view action recognition is a rapidly growing field in computer vision, offering significant potential for applications in surveillance. However, current datasets often fail to address real-world challenges such as wide-area distributed settings, asynchronous data streams, and the lack of frame-level annotations. Furthermore, existing methods face difficulties in effectively modeling inter-view relationships and enhancing spatial feature learning. In this paper, we introduce the MultiSensor-Home dataset, a novel benchmark designed for comprehensive action recognition in home environments, and also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method. The proposed MultiSensor-Home dataset features untrimmed videos captured by distributed sensors, providing high-resolution RGB and audio data along with detailed multi-view frame-level action labels. The proposed MultiTSF method leverages a Transformer-based fusion mechanism to dynamically model inter-view relationships. Furthermore, the proposed method integrates a human detection module to enhance spatial feature learning, guiding the model to prioritize frames with human activity to enhance action the recognition accuracy. Experiments on the proposed MultiSensor-Home and the existing MM-Office datasets demonstrate the superiority of MultiTSF over the state-of-the-art methods. Quantitative and qualitative results highlight the effectiveness of the proposed method in advancing real-world multi-modal multi-view action recognition. The source code is available at https://github.com/thanhhff/MultiTSF.

Paper Structure

This paper contains 33 sections, 14 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Configuration of multi-view settings. (a) Multiple sensors capturing the same area. (b) Multiple sensors capturing different areas, which is the environment targeted in this study.
  • Figure 2: Room layout illustrating the placement of multi-view sensors in the proposed MultiSensor-Home dataset.
  • Figure 3: Example from the proposed MultiSensor-Home dataset showcasing actions captured from multiple views.
  • Figure 4: Overview of the proposed MultiTSF method. It consists of: (1) Multi-modal Feature Extraction using Shared Audio Encoder and Shared Visual Encoder to extract discriminative features, (2) Human Detection Module to detect human presence and generate pseudo-ground-truth labels, and (3) Temporal Modeling and Transformer-based Fusion to capture temporal dependencies and integrate spatiotemporal features for action recognition.
  • Figure 5: Visualization of multi-view and temporal attention heatmaps from the Shared Visual Encoder on the MultiSensor-Home dataset.
  • ...and 1 more figures