Table of Contents
Fetching ...

Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition

Sumin Lee, Yooseung Wang, Sangmin Woo, Changick Kim

TL;DR

Through extensive experiments, this work validate the effectiveness of the spatio-temporal proximity among individuals and the dual-path architecture in PAR and achieves new state-of-the-art performance with 46.5\% of overall F1 score on JRDB-PAR dataset.

Abstract

Panoramic Activity Recognition (PAR) seeks to identify diverse human activities across different scales, from individual actions to social group and global activities in crowded panoramic scenes. PAR presents two major challenges: 1) recognizing the nuanced interactions among numerous individuals and 2) understanding multi-granular human activities. To address these, we propose Social Proximity-aware Dual-Path Network (SPDP-Net) based on two key design principles. First, while previous works often focus on spatial distance among individuals within an image, we argue to consider the spatio-temporal proximity. It is crucial for individual relation encoding to correctly understand social dynamics. Secondly, deviating from existing hierarchical approaches (individual-to-social-to-global activity), we introduce a dual-path architecture for multi-granular activity recognition. This architecture comprises individual-to-global and individual-to-social paths, mutually reinforcing each other's task with global-local context through multiple layers. Through extensive experiments, we validate the effectiveness of the spatio-temporal proximity among individuals and the dual-path architecture in PAR. Furthermore, SPDP-Net achieves new state-of-the-art performance with 46.5\% of overall F1 score on JRDB-PAR dataset.

Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition

TL;DR

Through extensive experiments, this work validate the effectiveness of the spatio-temporal proximity among individuals and the dual-path architecture in PAR and achieves new state-of-the-art performance with 46.5\% of overall F1 score on JRDB-PAR dataset.

Abstract

Panoramic Activity Recognition (PAR) seeks to identify diverse human activities across different scales, from individual actions to social group and global activities in crowded panoramic scenes. PAR presents two major challenges: 1) recognizing the nuanced interactions among numerous individuals and 2) understanding multi-granular human activities. To address these, we propose Social Proximity-aware Dual-Path Network (SPDP-Net) based on two key design principles. First, while previous works often focus on spatial distance among individuals within an image, we argue to consider the spatio-temporal proximity. It is crucial for individual relation encoding to correctly understand social dynamics. Secondly, deviating from existing hierarchical approaches (individual-to-social-to-global activity), we introduce a dual-path architecture for multi-granular activity recognition. This architecture comprises individual-to-global and individual-to-social paths, mutually reinforcing each other's task with global-local context through multiple layers. Through extensive experiments, we validate the effectiveness of the spatio-temporal proximity among individuals and the dual-path architecture in PAR. Furthermore, SPDP-Net achieves new state-of-the-art performance with 46.5\% of overall F1 score on JRDB-PAR dataset.
Paper Structure (35 sections, 4 equations, 13 figures, 10 tables)

This paper contains 35 sections, 4 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Importance of the spatio-temporal proximity for understanding social group dynamics. To distinguish between social groups, it is crucial to leverage positional relationships among individuals not just in space but also over time. Consider an initial scene where individuals marked with red, yellow, and green bounding boxes are close to each other, giving the impression that they belong to the same social group. However, as time goes on, it becomes evident that only the individuals in the red and yellow boxes move together, indicating shared social group membership, while the person in the green box does not.
  • Figure 1: A detailed overview of (a) parallel, (b) hierarchical, and (c) reverse hierarchical architectures.
  • Figure 2: Overview of the proposed SPDP-Net. SPDP-Net consists of two stages: 1) proximity-based relation encoding and 2) multi-granular activity recognition. $T_0$ indicates the center frame of a given video.
  • Figure 2: Visualization of the ground-truth (GT) and predicted relation matrix $R$, the proximity relation matrix $R_p$, and the similarity matrix $R_s$. Best viewed zoomed in on screen.
  • Figure 3: Detailed architecture of two stages in SPDP-Net. (a) Proximity-based relation encoding and (b) multi-granular activity recognition (i.e., DPATr).
  • ...and 8 more figures