Table of Contents
Fetching ...

Learning Group Activity Features Through Person Attribute Prediction

Chihiro Nakatani, Hiroaki Kawashima, Norimichi Ukita

TL;DR

This work introduces Group Activity Feature (GAF) learning to capture complex multi-person activity as a compact latent vector, learned via per-person attribute prediction without requiring explicit group activity labels. The method uses a Masked Person Modeling (MPM) stage and a transformer-based GAF network to produce a latent G that, together with location guidance, enables accurate per-person attribute prediction for two variants: GAFL-PAC (action classes) and GAFL-PAF (appearance features). Experiments on Volleyball and the Collective Activity Dataset demonstrate superior retrieval and group activity recognition performance, and visualization shows GAFs capturing fine-grained group contexts. The approach serves as a strong pretraining signal for downstream GAR and other group-centric tasks, highlighting the practical impact of annotation-free GAF learning.

Abstract

This paper proposes Group Activity Feature (GAF) learning in which features of multi-person activity are learned as a compact latent vector. Unlike prior work in which the manual annotation of group activities is required for supervised learning, our method learns the GAF through person attribute prediction without group activity annotations. By learning the whole network in an end-to-end manner so that the GAF is required for predicting the person attributes of people in a group, the GAF is trained as the features of multi-person activity. As a person attribute, we propose to use a person's action class and appearance features because the former is easy to annotate due to its simpleness, and the latter requires no manual annotation. In addition, we introduce a location-guided attribute prediction to disentangle the complex GAF for extracting the features of each target person properly. Various experimental results validate that our method outperforms SOTA methods quantitatively and qualitatively on two public datasets. Visualization of our GAF also demonstrates that our method learns the GAF representing fined-grained group activity classes. Code: https://github.com/chihina/GAFL-CVPR2024.

Learning Group Activity Features Through Person Attribute Prediction

TL;DR

This work introduces Group Activity Feature (GAF) learning to capture complex multi-person activity as a compact latent vector, learned via per-person attribute prediction without requiring explicit group activity labels. The method uses a Masked Person Modeling (MPM) stage and a transformer-based GAF network to produce a latent G that, together with location guidance, enables accurate per-person attribute prediction for two variants: GAFL-PAC (action classes) and GAFL-PAF (appearance features). Experiments on Volleyball and the Collective Activity Dataset demonstrate superior retrieval and group activity recognition performance, and visualization shows GAFs capturing fine-grained group contexts. The approach serves as a strong pretraining signal for downstream GAR and other group-centric tasks, highlighting the practical impact of annotation-free GAF learning.

Abstract

This paper proposes Group Activity Feature (GAF) learning in which features of multi-person activity are learned as a compact latent vector. Unlike prior work in which the manual annotation of group activities is required for supervised learning, our method learns the GAF through person attribute prediction without group activity annotations. By learning the whole network in an end-to-end manner so that the GAF is required for predicting the person attributes of people in a group, the GAF is trained as the features of multi-person activity. As a person attribute, we propose to use a person's action class and appearance features because the former is easy to annotate due to its simpleness, and the latter requires no manual annotation. In addition, we introduce a location-guided attribute prediction to disentangle the complex GAF for extracting the features of each target person properly. Various experimental results validate that our method outperforms SOTA methods quantitatively and qualitatively on two public datasets. Visualization of our GAF also demonstrates that our method learns the GAF representing fined-grained group activity classes. Code: https://github.com/chihina/GAFL-CVPR2024.
Paper Structure (52 sections, 5 equations, 17 figures, 6 tables)

This paper contains 52 sections, 5 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Difference between annotations for GAR and our group activity feature learning. (a) Supervised GAR employs group activity annotations that are difficult due to various similar group activities. (b-1) Our GAF learning employs person action annotations that are easy due to their simplicity. (b-2) We further propose annotation-free GAF learning with person appearance features.
  • Figure 2: Example of our group activity feature learning. In this example, a group activity feature is learned to extract the scene context (i.e., spike group activity) through prediction of person attribute (e.g., digging). See Fig. \ref{['fig:overview_network']} for the detailed architecture.
  • Figure 3: Overview of our GAF learning network. (a) Person feature extractor. The person feature is composed of appearance and location features. (b) GAF learning network. The GAF is learned from extracted people features. (c) Location-guided attribute prediction network with the GAF. The attribute of each person is predicted from the location feature of the person and the GAF extracted in (b).
  • Figure 4: Confusion matrices of GAR by nearest neighbor retrieval on VBD and CAD in GAFL-PAF. Each row and column show the ground-truth and recognized group activity, respectively. Results of the other methods are shown in the supplementary material.
  • Figure 5: Visualization of the learned GAF on VBD and CAD in GAFL-PAF. The color of each sample shows the ground-truth of the group activity label corresponding to each test sample.
  • ...and 12 more figures