Table of Contents
Fetching ...

MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer

Taiga Yamane, Satoshi Suzuki, Ryo Masumura, Shotaro Tora

TL;DR

This work tackles spatio-temporal action recognition (STAR) under occlusion by leveraging multi-view RGB inputs with a transformer-based cooperation module. MVAFormer preserves spatial information by applying RoIAlign per view and fusing per-person feature maps through a transformer that is explicitly divided into Same View Attention and Different View Attention, enabling effective cross-view collaboration. Evaluations on a new four-view dataset derived from MMAct/AVA show that MVAFormer outperforms single-view and prior multi-view baselines, with notable gains in precision, recall, and F-measure. The approach highlights the importance of spatially-aware, view-symmetric fusion for dense, per-person action recognition in occluded, multi-view scenarios, and lays groundwork for scalable multi-view STAR research.

Abstract

Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person's action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately $4.4$ points on the F-measure.

MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer

TL;DR

This work tackles spatio-temporal action recognition (STAR) under occlusion by leveraging multi-view RGB inputs with a transformer-based cooperation module. MVAFormer preserves spatial information by applying RoIAlign per view and fusing per-person feature maps through a transformer that is explicitly divided into Same View Attention and Different View Attention, enabling effective cross-view collaboration. Evaluations on a new four-view dataset derived from MMAct/AVA show that MVAFormer outperforms single-view and prior multi-view baselines, with notable gains in precision, recall, and F-measure. The approach highlights the importance of spatially-aware, view-symmetric fusion for dense, per-person action recognition in occluded, multi-view scenarios, and lays groundwork for scalable multi-view STAR research.

Abstract

Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person's action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately points on the F-measure.

Paper Structure

This paper contains 14 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of MVAFormer. For simplicity, we show an example where only one person is depicted in the multi-view video.
  • Figure 2: (Left) Vanilla transformer, which consists of Self-Attention, FFN, and LayerNorm. (Right) Our transformer, which consists of the Same View Attention (SVA), Different View Attention (DVA), FFN, and LayerNorm.
  • Figure 3: Attention mask in (Left) SVA and (Right) DVA. Each row indicates the query in the attention. Each column indicates the key and value in the attention.
  • Figure 4: Size of each action class in our dataset in descending order.
  • Figure 5: Visualization of attention weights for the red-filled circle feature from view $1$.