Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition
Muhammad Adi Nugroho, Sangmin Woo, Sumin Lee, Jinyoung Park, Yooseung Wang, Donguk Kim, Changick Kim
TL;DR
This work tackles weakly-supervised group activity recognition by introducing Flaming-Net, a flow-guided motion-learning network. It combines a motion-aware actor encoder with a dual-path relation module that separately models long-range actor dynamics (actor-centric) and frame-wise group interactions (group-centric). Optical flow guides the encoder during training through a contrastive flow loss, complemented by a temporal-consistency objective and an auxiliary per-frame classifier loss, while inference relies only on RGB data. Flaming-Net achieves state-of-the-art performance on NBA and Volleyball WSGAR benchmarks, demonstrating strong gains in mean per-class accuracy and overall activity recognition, especially in long-range, complex inter-actor scenarios. The approach offers a practical, detector-free training paradigm with interpretable attention visualizations and ablation-supported design choices that validate the importance of motion-aware actor representations and dual-path relational reasoning.
Abstract
Weakly-Supervised Group Activity Recognition (WSGAR) aims to understand the activity performed together by a group of individuals with the video-level label and without actor-level labels. We propose Flow-Assisted Motion Learning Network (Flaming-Net) for WSGAR, which consists of the motion-aware actor encoder to extract actor features and the two-pathways relation module to infer the interaction among actors and their activity. Flaming-Net leverages an additional optical flow modality in the training stage to enhance its motion awareness when finding locally active actors. The first pathway of the relation module, the actor-centric path, initially captures the temporal dynamics of individual actors and then constructs inter-actor relationships. In parallel, the group-centric path starts by building spatial connections between actors within the same timeframe and then captures simultaneous spatio-temporal dynamics among them. We demonstrate that Flaming-Net achieves new state-of-the-art WSGAR results on two benchmarks, including a 2.8%p higher MPCA score on the NBA dataset. Importantly, we use the optical flow modality only for training and not for inference.
