Table of Contents
Fetching ...

FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos

Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, Sotiris Manitsaris

TL;DR

FOOTPASS tackles the challenge of automatically reconstructing play-by-play data in soccer by coupling broadcast-video perception with long-range tactical reasoning. The authors introduce FOOTPASS, a public benchmark comprising 54 full matches with synchronized play-by-play and game-state annotations, including player identities and single-player tracklets, enabling multi-modal learning. Benchmark results show that incorporating tactical priors and long-range reasoning (as in TAAD+DST) substantially improves precision and recall in high-recall regimes and enhances robustness to occlusion and replays. This dataset and the accompanying baselines offer a realistic, reproducible foundation for advancing person-centric event spotting and data-driven soccer analytics.

Abstract

Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.

FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos

TL;DR

FOOTPASS tackles the challenge of automatically reconstructing play-by-play data in soccer by coupling broadcast-video perception with long-range tactical reasoning. The authors introduce FOOTPASS, a public benchmark comprising 54 full matches with synchronized play-by-play and game-state annotations, including player identities and single-player tracklets, enabling multi-modal learning. Benchmark results show that incorporating tactical priors and long-range reasoning (as in TAAD+DST) substantially improves precision and recall in high-recall regimes and enhances robustness to occlusion and replays. This dataset and the accompanying baselines offer a realistic, reproducible foundation for advancing person-centric event spotting and data-driven soccer analytics.

Abstract

Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.

Paper Structure

This paper contains 45 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Samples of the dataset classes. From left to right : ball-drive, pass, cross, shot, header, throw-in, tackle and ball-block
  • Figure 2: Distribution of annotated events across the 8 on-ball classes. Passes and drives dominate, while decisive actions such as shots and tackles are comparatively rare.
  • Figure 3: Proportion of events per class with bounding-box annotation after interpolation and extrapolation. Coverage is highest for drives, passes, and shots, while headers, tackles, and throw-ins are more affected by occlusion and broadcast editing practices.
  • Figure 4: Distribution of events by broadcast mode (live vs. replay). Throw-ins frequently fall into replay segments because directors cut away to show the preceding sequence after the ball goes out of play, returning only after the throw-in has been executed. In contrast, headers, blocks, and shots occur almost exclusively during live play (>95% of the time).
  • Figure 5: Comparison of Average Precision (AP, $\times 10^2$) per action class across benchmarked methods, with $\delta$ = 12 frames.
  • ...and 1 more figures