FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos
Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, Sotiris Manitsaris
TL;DR
FOOTPASS tackles the challenge of automatically reconstructing play-by-play data in soccer by coupling broadcast-video perception with long-range tactical reasoning. The authors introduce FOOTPASS, a public benchmark comprising 54 full matches with synchronized play-by-play and game-state annotations, including player identities and single-player tracklets, enabling multi-modal learning. Benchmark results show that incorporating tactical priors and long-range reasoning (as in TAAD+DST) substantially improves precision and recall in high-recall regimes and enhances robustness to occlusion and replays. This dataset and the accompanying baselines offer a realistic, reproducible foundation for advancing person-centric event spotting and data-driven soccer analytics.
Abstract
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
