Pixels or Positions? Benchmarking Modalities in Group Activity Recognition
Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem
TL;DR
This work benchmarks pixels versus positions for Group Activity Recognition by introducing SoccerNet-GAR, a synchronized multimodal dataset built from 64 World Cup matches with 94,285 labeled events across 10 classes. It provides an apples-to-apples evaluation protocol for video and tracking modalities, and presents a novel role-aware graph neural network for tracking-based GAR alongside strong video baselines. Across extensive ablations, the tracking approach with positional edges and temporal attention achieves $67.2\%$ balanced accuracy, outpacing the best video baseline by $9.1$ percentage points while using drastically fewer parameters ($197{,}K$ vs $86.3{,}M$) and training faster. The results highlight the superior signal in spatial-temporal relations captured by tracking data for coordinated team actions, though per-class analyses reveal complementary strengths, motivating future multimodal fusion.
Abstract
Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the $64$ matches of the football World Cup 2022. Specifically, the broadcast videos and player tracking modalities for $94{,}285$ group activities are synchronized and annotated with $10$ categories. Furthermore, we define a unified evaluation protocol to benchmark two strong unimodal approaches: (i) a competitive video-based classifiers and (ii) a tracking-based classifiers leveraging graph neural networks. In particular, our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges and temporal attention. Our tracking model achieves $67.2\%$ balanced accuracy compared to $58.1\%$ for the best video baseline, while training $4.25 \times$ faster with $438 \times$ fewer parameters ($197K$ \vs $86.3M$). This study provides new insights into the relative strengths of pixels and positions for group activity recognition. Overall, it highlights the importance of modality choice and role-aware modeling for GAR.
