Table of Contents
Fetching ...

Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning

Ryuki Tezuka, Chihiro Nakatani, Norimichi Ukita

Abstract

This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method. Code: https://github.com/tezuka0001/Group-DINOmics.

Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning

Abstract

This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method. Code: https://github.com/tezuka0001/Group-DINOmics.

Paper Structure

This paper contains 51 sections, 6 equations, 14 figures, 19 tables.

Figures (14)

  • Figure 1: Our self-supervised GAF learning augmented by two pretext tasks: (1) person flow estimation for local dynamics embedding into GAFs and (2) group-relevant object localization for global context embedding into GAFs. Compared with previous self-supervised methods DBLP:conf/eccv/HRNDBLP:conf/cvpr/GAFL that utilize only local appearance features, our pretext tasks enhance GAF learning.
  • Figure 2: Overview of our network. (a) Image feature extractor. Group-relevant objects are inpainted to enhance global feature learning in (c). (b) GAF learning network. Image features are fed into the transformer encoder, MLP, and temporal pooling to obtain a GAF. (c) Pretext tasks for GAF learning. The flow of each person and the locations of the group-relevant objects are estimated from the GAF.
  • Figure 3: Inpainting to enhance global feature embedding into a GAF by localizing group-relevant objects (i.e., ball in our method). Top: w/o inpainting. Bottom: w/ inpainting (our method).
  • Figure 4: Pretext tasks: overview of person flow estimation, (a) and (b), and group-relevant object location estimation, (c) and (d). $\bm{L}^{p,t}$ and $\bm{T}^{t}$ are omitted for simplicity.
  • Figure 5: Visual comparison of group activity retrieval on VBD. (a) R-set query. (b) R-spike query.
  • ...and 9 more figures