Table of Contents
Fetching ...

Action Selection Learning for Multi-label Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide

TL;DR

This paper proposes a method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints, and includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos.

Abstract

Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods. The source code is available at https://github.com/thanhhff/MultiASL/.

Action Selection Learning for Multi-label Multi-view Action Recognition

TL;DR

This paper proposes a method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints, and includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos.

Abstract

Multi-label multi-view action recognition aims to recognize multiple concurrent or sequential actions from untrimmed videos captured by multiple cameras. Existing work has focused on multi-view action recognition in a narrow area with strong labels available, where the onset and offset of each action are labeled at the frame-level. This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level. We propose the method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion by selecting the most useful information from different viewpoints. The proposed method includes a Multi-view Spatial-Temporal Transformer video encoder to extract spatial and temporal features from multi-viewpoint videos. Action Selection Learning is employed at the frame-level, using pseudo ground-truth obtained from weak labels at the video-level, to identify the most relevant frames for action recognition. Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods. The source code is available at https://github.com/thanhhff/MultiASL/.
Paper Structure (22 sections, 7 equations, 4 figures, 3 tables)

This paper contains 22 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Configuration of multi-camera settings. (a) Multiple cameras arranged to surround a target in a narrow area. (b) Multiple distributed cameras covering a wide-range area, which is the environment targeted in this study.
  • Figure 2: Overview of the proposed MultiASL model. It takes videos from $N$ different views as input and predicts multi-label actions. (a) Each video is processed by a Shared Spatial Encoder to extract spatial features, which are then fed into a Shared Temporal Transformer Encoder to capture temporal dependencies and generate temporal features. Finally, view-level features are aggregated for action recognition. (b) Frame-level spatial and temporal features are fused to select actions based on video-level labels.
  • Figure 3: Logit matrix $\mathbf{X}$ for $M$ video samples and $C$ classes for Two-way multi-label loss kobayashi2023two.
  • Figure 4: Example of generating pseudo ground-truth for actionness loss. The video consists of 9 frames and 4 video-level classes, with the ground-truth classes being 3 and 4. We select the top-3 predictions for each class (in orange). The positive class for frame-level is selected by taking the logical sum of the selected frames across the ground-truth video-level classes (in green hatched texture).