Survey of Action Recognition, Spotting and Spatio-Temporal Localization in Soccer -- Current Trends and Research Perspectives

Karolina Seweryn; Anna Wróblewska; Szymon Łukasik

Survey of Action Recognition, Spotting and Spatio-Temporal Localization in Soccer -- Current Trends and Research Perspectives

Karolina Seweryn, Anna Wróblewska, Szymon Łukasik

TL;DR

This survey systematically catalogs the state of soccer action understanding across three tasks: action recognition, spotting, and spatio-temporal localization. It emphasizes multimodal learning and the integration of video, audio, text, and pose/graph data, reviewing datasets (notably SoccerNet families and MultiSports) and a broad array of methods from classical feature-based approaches to transformer- and graph-based models. Key metrics such as $mAP$, $avg$-$mAP$, and $3D$ $IoU$ are detailed for evaluating performance, alongside motion-aware extensions. The paper highlights substantial progress driven by richer representations and public datasets, while identifying ongoing challenges in data availability, annotation quality, and the development of robust soccer-specific localization benchmarks. It further argues that multimodal data and cross-domain datasets hold substantial promise for advancing real-world soccer analytics, highlighting future directions such as text-informed action understanding and more comprehensive spatio-temporal localization datasets.

Abstract

Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly available data sources and metrics used to evaluate models' performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional methods. We focus on multimodal methods, which integrate information from multiple sources, such as video and audio data, and also those that represent one source in various ways. The advantages and limitations of methods are discussed, along with their potential for improving the accuracy and robustness of models. Finally, the article highlights some of the open research questions and future directions in the field of soccer action recognition, including the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of action scene understanding in soccer.

Survey of Action Recognition, Spotting and Spatio-Temporal Localization in Soccer -- Current Trends and Research Perspectives

TL;DR

, and

are detailed for evaluating performance, alongside motion-aware extensions. The paper highlights substantial progress driven by richer representations and public datasets, while identifying ongoing challenges in data availability, annotation quality, and the development of robust soccer-specific localization benchmarks. It further argues that multimodal data and cross-domain datasets hold substantial promise for advancing real-world soccer analytics, highlighting future directions such as text-informed action understanding and more comprehensive spatio-temporal localization datasets.

Abstract

Paper Structure (45 sections, 13 equations, 9 figures, 7 tables)

This paper contains 45 sections, 13 equations, 9 figures, 7 tables.

Introduction
Motivation
Potential of Using Multimodality
Definition of Research Strategy
Problem Description
Actions
Multimodality
Metrics
Action recognition
Action spotting
Spatio-temporal action detection
3D Intersection over Union (3D IoU)
MABO (Mean Average Best Overlap)
Video-mAP@$\delta$ and Frame-mAP@$\delta$
Motion mAP and Motion AP
...and 30 more sections

Figures (9)

Figure 1: Difference between primary and secondary actions.
Figure 2: Comparison of tasks related to action analysis. Frames used in this visualization are from SoccerNet soccernet-v2 and MultiSports multisports-li-yixuan datasets.
Figure 3: Difference between extracting information by people and machines. = denotes the same data, while $\neq$ means different. Visualization inspired by parcalabescu2021multimodality.
Figure 4: Examples of actions from SoccerNet-v2 dataset soccernet-v2. Frames come from the match between Liverpool and Swansea (2017-01-21 - 15:30).
Figure 5: Distribution of commentary languages detected by Whisper radford2022whisper in SoccerNet-v2 soccernet-v2 dataset.
...and 4 more figures

Survey of Action Recognition, Spotting and Spatio-Temporal Localization in Soccer -- Current Trends and Research Perspectives

TL;DR

Abstract

Survey of Action Recognition, Spotting and Spatio-Temporal Localization in Soccer -- Current Trends and Research Perspectives

Authors

TL;DR

Abstract

Table of Contents

Figures (9)