Table of Contents
Fetching ...

Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition

Muhammad Adi Nugroho, Sangmin Woo, Sumin Lee, Jinyoung Park, Yooseung Wang, Donguk Kim, Changick Kim

TL;DR

This work tackles weakly-supervised group activity recognition by introducing Flaming-Net, a flow-guided motion-learning network. It combines a motion-aware actor encoder with a dual-path relation module that separately models long-range actor dynamics (actor-centric) and frame-wise group interactions (group-centric). Optical flow guides the encoder during training through a contrastive flow loss, complemented by a temporal-consistency objective and an auxiliary per-frame classifier loss, while inference relies only on RGB data. Flaming-Net achieves state-of-the-art performance on NBA and Volleyball WSGAR benchmarks, demonstrating strong gains in mean per-class accuracy and overall activity recognition, especially in long-range, complex inter-actor scenarios. The approach offers a practical, detector-free training paradigm with interpretable attention visualizations and ablation-supported design choices that validate the importance of motion-aware actor representations and dual-path relational reasoning.

Abstract

Weakly-Supervised Group Activity Recognition (WSGAR) aims to understand the activity performed together by a group of individuals with the video-level label and without actor-level labels. We propose Flow-Assisted Motion Learning Network (Flaming-Net) for WSGAR, which consists of the motion-aware actor encoder to extract actor features and the two-pathways relation module to infer the interaction among actors and their activity. Flaming-Net leverages an additional optical flow modality in the training stage to enhance its motion awareness when finding locally active actors. The first pathway of the relation module, the actor-centric path, initially captures the temporal dynamics of individual actors and then constructs inter-actor relationships. In parallel, the group-centric path starts by building spatial connections between actors within the same timeframe and then captures simultaneous spatio-temporal dynamics among them. We demonstrate that Flaming-Net achieves new state-of-the-art WSGAR results on two benchmarks, including a 2.8%p higher MPCA score on the NBA dataset. Importantly, we use the optical flow modality only for training and not for inference.

Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition

TL;DR

This work tackles weakly-supervised group activity recognition by introducing Flaming-Net, a flow-guided motion-learning network. It combines a motion-aware actor encoder with a dual-path relation module that separately models long-range actor dynamics (actor-centric) and frame-wise group interactions (group-centric). Optical flow guides the encoder during training through a contrastive flow loss, complemented by a temporal-consistency objective and an auxiliary per-frame classifier loss, while inference relies only on RGB data. Flaming-Net achieves state-of-the-art performance on NBA and Volleyball WSGAR benchmarks, demonstrating strong gains in mean per-class accuracy and overall activity recognition, especially in long-range, complex inter-actor scenarios. The approach offers a practical, detector-free training paradigm with interpretable attention visualizations and ablation-supported design choices that validate the importance of motion-aware actor representations and dual-path relational reasoning.

Abstract

Weakly-Supervised Group Activity Recognition (WSGAR) aims to understand the activity performed together by a group of individuals with the video-level label and without actor-level labels. We propose Flow-Assisted Motion Learning Network (Flaming-Net) for WSGAR, which consists of the motion-aware actor encoder to extract actor features and the two-pathways relation module to infer the interaction among actors and their activity. Flaming-Net leverages an additional optical flow modality in the training stage to enhance its motion awareness when finding locally active actors. The first pathway of the relation module, the actor-centric path, initially captures the temporal dynamics of individual actors and then constructs inter-actor relationships. In parallel, the group-centric path starts by building spatial connections between actors within the same timeframe and then captures simultaneous spatio-temporal dynamics among them. We demonstrate that Flaming-Net achieves new state-of-the-art WSGAR results on two benchmarks, including a 2.8%p higher MPCA score on the NBA dataset. Importantly, we use the optical flow modality only for training and not for inference.
Paper Structure (29 sections, 6 equations, 10 figures, 8 tables)

This paper contains 29 sections, 6 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Importance of motion-awareness. In WSGAR, it is crucial to capture the movements of key actors. In the second frame, DFWSGAR Kim2022-dfwsgar fails to recognize the layup action, as the player doing the layup is not strongly highlighted. In contrast, Flaming-Net strongly highlights the player performing the layup, thanks to its motion awareness. Flaming-Net leverages optical flow as a learning guidance motivated by how an area with intense optical flow indicates the existence of key actors, as seen in the second frame. Lastly, in the third frame, our method successfully identifies multiple key actors, such as the celebrating scoring team, the defending team looking down, and the referee, as strong cues to correctly predict a successful layup attempt.
  • Figure 2: Visualization of the motion-aware actor encoder attention maps on the activity of (i) 3p-success and (ii) 2p-layup-fail.-def. Flaming-Net highlights the key actors, e.g. 3-point shooter and referee giving hand sign in the sequence (i), and offensive player doing layup and counter-attacking defensive players in (ii).
  • Figure 3: Overall architecture of Flaming-Net. For each frame, a 2D CNN backbone generates a feature map and the motion-aware actor encoder extracts actor features that represent important actors or entities. The sequence of actor features is then forwarded to two different paths: actor motion path and group motion path. To improve the learning process, we add local motion learning with a flow-map-based label to help the model learn local motion. Additionally, we include a temporal consistency loss and a frame-level classifier. Note that optical flow is only used in the training stage.
  • Figure 4: Motion-aware actor encoder transforms actor queries and the feature map using a series of multi-head attention modules to generate actor features. The flow learning loss guides the encoders using the optical flow map to highlight active key actors. The temporal consistency loss $\ell_{\text{tco}}$ encourages each token to represent the same actor across frames. The loss attracts temporally adjacent actor features belonging to the same index, while repels those with different indices. Here, we assume $L=2$.
  • Figure 4: Ablation on the number of $K_{\text{flw}}$.
  • ...and 5 more figures