ASTRA: An Action Spotting TRAnsformer for Soccer Videos
Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clapés
TL;DR
This work tackles precise temporal action spotting in untrimmed soccer videos, addressing long-tail distributions, non-visible actions, and label noise. It presents ASTRA, a DETR-inspired Transformer with multimodal inputs (visual and audio), a hierarchical encoder, and an uncertainty-aware displacement head, coupled with balanced mixup and temporal augmentations. The model achieves strong results, with an Average-mAP of $66.82$ on the test set and $70.21$ on the SoccerNet 2023 challenge (ensemble), illustrating the benefits of uncertainty modeling and audio cues for robust localization. Overall, ASTRA advances end-to-end action spotting by effectively leveraging temporal anchors and Gaussian-distributed displacements to improve localization under label noise.
Abstract
In this paper, we introduce ASTRA, a Transformer-based model designed for the task of Action Spotting in soccer matches. ASTRA addresses several challenges inherent in the task and dataset, including the requirement for precise action localization, the presence of a long-tail data distribution, non-visibility in certain actions, and inherent label noise. To do so, ASTRA incorporates (a) a Transformer encoder-decoder architecture to achieve the desired output temporal resolution and to produce precise predictions, (b) a balanced mixup strategy to handle the long-tail distribution of the data, (c) an uncertainty-aware displacement head to capture the label variability, and (d) input audio signal to enhance detection of non-visible actions. Results demonstrate the effectiveness of ASTRA, achieving a tight Average-mAP of 66.82 on the test set. Moreover, in the SoccerNet 2023 Action Spotting challenge, we secure the 3rd position with an Average-mAP of 70.21 on the challenge set.
