Towards Active Learning for Action Spotting in Association Football Videos

Silvio Giancola; Anthony Cioppa; Julia Georgieva; Johsan Billingham; Andreas Serner; Kerry Peek; Bernard Ghanem; Marc Van Droogenbroeck

Towards Active Learning for Action Spotting in Association Football Videos

Silvio Giancola, Anthony Cioppa, Julia Georgieva, Johsan Billingham, Andreas Serner, Kerry Peek, Bernard Ghanem, Marc Van Droogenbroeck

TL;DR

The paper presents an active learning framework for action spotting in football videos to reduce annotation cost and accelerate training. By leveraging uncertainty sampling (Uncertainty Measure and Entropy Measure), the method selectively annotates the most informative clips and iteratively trains action spotting models (e.g., NetVLAD++ and PTS) on progressively enriched data, achieving data-efficient performance on SoccerNet-v2 and two additional datasets. Key contributions include formalizing the first active learning workflow for action spotting, comparing sampling strategies, and introducing accelerations (adaptive scheduling, faster training, continual fine-tuning) that maintain performance. The approach promises practical impact by shrinking annotation labor and speeding up deployment of robust action-spotting systems in sports analytics.

Abstract

Association football is a complex and dynamic sport, with numerous actions occurring simultaneously in each game. Analyzing football videos is challenging and requires identifying subtle and diverse spatio-temporal patterns. Despite recent advances in computer vision, current algorithms still face significant challenges when learning from limited annotated data, lowering their performance in detecting these patterns. In this paper, we propose an active learning framework that selects the most informative video samples to be annotated next, thus drastically reducing the annotation effort and accelerating the training of action spotting models to reach the highest accuracy at a faster pace. Our approach leverages the notion of uncertainty sampling to select the most challenging video clips to train on next, hastening the learning process of the algorithm. We demonstrate that our proposed active learning framework effectively reduces the required training data for accurate action spotting in football videos. We achieve similar performances for action spotting with NetVLAD++ on SoccerNet-v2, using only one-third of the dataset, indicating significant capabilities for reducing annotation time and improving data efficiency. We further validate our approach on two new datasets that focus on temporally localizing actions of headers and passes, proving its effectiveness across different action semantics in football. We believe our active learning framework for action spotting would support further applications of action spotting algorithms and accelerate annotation campaigns in the sports domain.

Towards Active Learning for Action Spotting in Association Football Videos

TL;DR

Abstract

Paper Structure (15 sections, 2 equations, 5 figures, 6 tables)

This paper contains 15 sections, 2 equations, 5 figures, 6 tables.

Introduction
Related work
Sports video understanding
Action spotting
Active learning
Active learning for action spotting
Model training step
Active selection step
Annotation step
Experiments
Experimental setup
Initial results
Accelerating the active learning framework
Generalization analyses
Conclusion

Figures (5)

Figure 1: Active learning for action spotting. Given a video clip dataset, our active learning framework iteratively (i) trains a deep learning model on the labeled clips, and (ii) selects the next video clips to be labeled by an oracle. By actively selecting the most informative video samples to annotate next, we accelerate the tagging of unlabeled datasets for training action spotting models.
Figure 2: Active learning pipeline for action spotting. We start from a small labeled dataset $\mathcal{L}$ on which we train an action spotting model whose inference function is denoted $f$. With the trained model, we select from an unlabeled dataset $\mathcal{U}$ which sample to annotate next. For that, we first collect the prediction of the model $f(\mathcal{U})$ for each clip and pass the predictions through our selection function $g$ that ranks the clips to select $\mathbf{C}^*$. All selected clips are then passed to the oracle (human annotator) to provide both the class and localization of all actions within that clip. These new annotated data are then added to the labeled dataset and used for the next training iteration. The process is repeated iteratively until the desired performance is reached or the unlabeled dataset is empty.
Figure 3: Active learning vs. random sampling. Our uncertainty sampling using the Entropy Measure (EM) converges to the optimal solution at a faster pace, using fewer data. In practice, active learning only needs 36% of the data needed by a random sampler to reach similar performances ($\mathcal{M}_{perf}^{90\%}$), and a similar amount of data could lead to up to 18% performance improvement ($\mathcal{M}_{data}^{4\%}$).
Figure 4: Faster training and adaptive active learning (AdapAL) paradigms. We show here that we can significantly decrease the active learning time for our experiments without reducing in any way the performance of the active learning training.
Figure 5: Effect of fine-tuning a limited number of epochs. Fine-tuning from a model with a limited number of epochs leads to more stability in the training for the next active learning step.

Towards Active Learning for Action Spotting in Association Football Videos

TL;DR

Abstract

Towards Active Learning for Action Spotting in Association Football Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (5)