Table of Contents
Fetching ...

ASTRA: An Action Spotting TRAnsformer for Soccer Videos

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clapés

TL;DR

This work tackles precise temporal action spotting in untrimmed soccer videos, addressing long-tail distributions, non-visible actions, and label noise. It presents ASTRA, a DETR-inspired Transformer with multimodal inputs (visual and audio), a hierarchical encoder, and an uncertainty-aware displacement head, coupled with balanced mixup and temporal augmentations. The model achieves strong results, with an Average-mAP of $66.82$ on the test set and $70.21$ on the SoccerNet 2023 challenge (ensemble), illustrating the benefits of uncertainty modeling and audio cues for robust localization. Overall, ASTRA advances end-to-end action spotting by effectively leveraging temporal anchors and Gaussian-distributed displacements to improve localization under label noise.

Abstract

In this paper, we introduce ASTRA, a Transformer-based model designed for the task of Action Spotting in soccer matches. ASTRA addresses several challenges inherent in the task and dataset, including the requirement for precise action localization, the presence of a long-tail data distribution, non-visibility in certain actions, and inherent label noise. To do so, ASTRA incorporates (a) a Transformer encoder-decoder architecture to achieve the desired output temporal resolution and to produce precise predictions, (b) a balanced mixup strategy to handle the long-tail distribution of the data, (c) an uncertainty-aware displacement head to capture the label variability, and (d) input audio signal to enhance detection of non-visible actions. Results demonstrate the effectiveness of ASTRA, achieving a tight Average-mAP of 66.82 on the test set. Moreover, in the SoccerNet 2023 Action Spotting challenge, we secure the 3rd position with an Average-mAP of 70.21 on the challenge set.

ASTRA: An Action Spotting TRAnsformer for Soccer Videos

TL;DR

This work tackles precise temporal action spotting in untrimmed soccer videos, addressing long-tail distributions, non-visible actions, and label noise. It presents ASTRA, a DETR-inspired Transformer with multimodal inputs (visual and audio), a hierarchical encoder, and an uncertainty-aware displacement head, coupled with balanced mixup and temporal augmentations. The model achieves strong results, with an Average-mAP of on the test set and on the SoccerNet 2023 challenge (ensemble), illustrating the benefits of uncertainty modeling and audio cues for robust localization. Overall, ASTRA advances end-to-end action spotting by effectively leveraging temporal anchors and Gaussian-distributed displacements to improve localization under label noise.

Abstract

In this paper, we introduce ASTRA, a Transformer-based model designed for the task of Action Spotting in soccer matches. ASTRA addresses several challenges inherent in the task and dataset, including the requirement for precise action localization, the presence of a long-tail data distribution, non-visibility in certain actions, and inherent label noise. To do so, ASTRA incorporates (a) a Transformer encoder-decoder architecture to achieve the desired output temporal resolution and to produce precise predictions, (b) a balanced mixup strategy to handle the long-tail distribution of the data, (c) an uncertainty-aware displacement head to capture the label variability, and (d) input audio signal to enhance detection of non-visible actions. Results demonstrate the effectiveness of ASTRA, achieving a tight Average-mAP of 66.82 on the test set. Moreover, in the SoccerNet 2023 Action Spotting challenge, we secure the 3rd position with an Average-mAP of 70.21 on the challenge set.
Paper Structure (16 sections, 3 equations, 6 figures, 3 tables)

This paper contains 16 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: ASTRA (Action Spotting TRAnsformer) architecture: Visual and audio embeddings from different backbones ($\varphi_v$ and $\varphi_a$) are combined and processed through a Transformer encoder-decoder ($\varGamma_e$ and $\varGamma_d$). The resulting embeddings are then utilized by a classification head ($\varLambda_s$) for temporal location classification, and a displacement head ($\varLambda_d$) to further refine predictions.
  • Figure 2: Example ground-truth action prediction with and without displacements. The utilization of only the classification head results in predictions spanning the entire range of detection of the ground-truth action (left), whereas incorporating displacements refines the predictions, aligning them closer to the actual temporal position of the ground-truth action (right).
  • Figure 3: Percentage of non-visible ground-truth actions for each action class.
  • Figure 4: Per-class results comparison of models M2, M3, and M5. The figure displays Average-AP scores for each action in M2 (bottom), the differences between M2 and M3 (middle), and the differences between M3 and M4 (top).
  • Figure 5: Analysis of M8 model: mean predicted displacement variance in temporal locations with classification probability greater than 0.5 (bottom) and difference in Average-AP between M7 and M8 (top).
  • ...and 1 more figures