Table of Contents
Fetching ...

T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clapés

TL;DR

T-DEED tackles Precise Event Spotting in sports videos by introducing a temporal-discriminability enhanced encoder–decoder that processes multiple temporal scales. It combines an SGP-based discriminability module with an encoder–decoder framework (via the novel SGP-Mixer) to restore high temporal resolution and integrate information across scales, enabling precise frame-level event localization. The approach yields state-of-the-art results on FigureSkating and FineDiving, with notable gains in tight evaluation metrics and robust ablations validating the components. This work advances PES by prioritizing token discriminability and multi-scale temporal integration, with practical implications for accurate sports analytics and broadcast applications.

Abstract

In this paper, we introduce T-DEED, a Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in sports videos. T-DEED addresses multiple challenges in the task, including the need for discriminability among frame representations, high output temporal resolution to maintain prediction precision, and the necessity to capture information at different temporal scales to handle events with varying dynamics. It tackles these challenges through its specifically designed architecture, featuring an encoder-decoder for leveraging multiple temporal scales and achieving high output temporal resolution, along with temporal modules designed to increase token discriminability. Leveraging these characteristics, T-DEED achieves SOTA performance on the FigureSkating and FineDiving datasets. Code is available at https://github.com/arturxe2/T-DEED.

T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos

TL;DR

T-DEED tackles Precise Event Spotting in sports videos by introducing a temporal-discriminability enhanced encoder–decoder that processes multiple temporal scales. It combines an SGP-based discriminability module with an encoder–decoder framework (via the novel SGP-Mixer) to restore high temporal resolution and integrate information across scales, enabling precise frame-level event localization. The approach yields state-of-the-art results on FigureSkating and FineDiving, with notable gains in tight evaluation metrics and robust ablations validating the components. This work advances PES by prioritizing token discriminability and multi-scale temporal integration, with practical implications for accurate sports analytics and broadcast applications.

Abstract

In this paper, we introduce T-DEED, a Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in sports videos. T-DEED addresses multiple challenges in the task, including the need for discriminability among frame representations, high output temporal resolution to maintain prediction precision, and the necessity to capture information at different temporal scales to handle events with varying dynamics. It tackles these challenges through its specifically designed architecture, featuring an encoder-decoder for leveraging multiple temporal scales and achieving high output temporal resolution, along with temporal modules designed to increase token discriminability. Leveraging these characteristics, T-DEED achieves SOTA performance on the FigureSkating and FineDiving datasets. Code is available at https://github.com/arturxe2/T-DEED.
Paper Structure (19 sections, 1 equation, 7 figures, 3 tables)

This paper contains 19 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of the Precise Event Spotting task on the FigureSkating dataset. Red-marked frames contain events that require precise localization and correct classification among possible classes.
  • Figure 2: Illustration of T-DEED architecture comprising three key components: (1) Feature extractor to produce per-frame representations, (2) Temporally discriminant encoder-decoder to capture local and global temporal information while enhancing token discriminability, and (3) Prediction head to generate per-frame classifications and displacements for refinement.
  • Figure 3: Illustration of the SGP layer structure, with its module comprising an instant-level branch to boost token discriminability and a window-level branch for temporal modeling.
  • Figure 4: Illustration of the SGP-Mixer layer structure integrating an SGP-Mixer module to aggregate features of different temporal scales. This module follows the SGP principles, incorporating instant-level and window-level branches to boost token discriminability while merging the features.
  • Figure 5: Temporal module discriminability analysis. Cosine similarity after backbone (BB), post-positional encoding (PE), and at each temporal layer is displayed. Additionally, mAP performance with $\delta=1$ is reported.
  • ...and 2 more figures