Table of Contents
Fetching ...

Domain-Guided Masked Autoencoders for Unique Player Identification

Bavesh Balaji, Jerrin Bright, Sirisha Rambhatla, Yuhao Chen, Alexander Wong, John Zelek, David A Clausi

TL;DR

This work tackles robust unique player identification from broadcast sports videos by introducing domain-guided masking for masked autoencoders (d-MAE) to withstand motion blur and occlusion. A spatio-temporal network combines d-MAE-derived spatial features with a transformer decoder to model tracklet-wide temporal cues, while an enhanced KfID module and keyframe fusion augment keyframe data. The approach achieves state-of-the-art results across SoccerNet, Ice Hockey, and Baseball, with notable improvements over existing jersey-number recognition methods, and is supported by extensive ablations highlighting the contributions of masking strategy, KfID refinements, and data augmentation. Overall, the method offers a robust, data-efficient framework for jersey-number-based player identification in real-world sports analytics.

Abstract

Unique player identification is a fundamental module in vision-driven sports analytics. Identifying players from broadcast videos can aid with various downstream tasks such as player assessment, in-game analysis, and broadcast production. However, automatic detection of jersey numbers using deep features is challenging primarily due to: a) motion blur, b) low resolution video feed, and c) occlusions. With their recent success in various vision tasks, masked autoencoders (MAEs) have emerged as a superior alternative to conventional feature extractors. However, most MAEs simply zero-out image patches either randomly or focus on where to mask rather than how to mask. Motivated by human vision, we devise a novel domain-guided masking policy for MAEs termed d-MAE to facilitate robust feature extraction in the presence of motion blur for player identification. We further introduce a new spatio-temporal network leveraging our novel d-MAE for unique player identification. We conduct experiments on three large-scale sports datasets, including a curated baseball dataset, the SoccerNet dataset, and an in-house ice hockey dataset. We preprocess the datasets using an upgraded keyframe identification (KfID) module by focusing on frames containing jersey numbers. Additionally, we propose a keyframe-fusion technique to augment keyframes, preserving spatial and temporal context. Our spatio-temporal network showcases significant improvements, surpassing the current state-of-the-art by 8.58%, 4.29%, and 1.20% in the test set accuracies, respectively. Rigorous ablations highlight the effectiveness of our domain-guided masking approach and the refined KfID module, resulting in performance enhancements of 1.48% and 1.84% respectively, compared to original architectures.

Domain-Guided Masked Autoencoders for Unique Player Identification

TL;DR

This work tackles robust unique player identification from broadcast sports videos by introducing domain-guided masking for masked autoencoders (d-MAE) to withstand motion blur and occlusion. A spatio-temporal network combines d-MAE-derived spatial features with a transformer decoder to model tracklet-wide temporal cues, while an enhanced KfID module and keyframe fusion augment keyframe data. The approach achieves state-of-the-art results across SoccerNet, Ice Hockey, and Baseball, with notable improvements over existing jersey-number recognition methods, and is supported by extensive ablations highlighting the contributions of masking strategy, KfID refinements, and data augmentation. Overall, the method offers a robust, data-efficient framework for jersey-number-based player identification in real-world sports analytics.

Abstract

Unique player identification is a fundamental module in vision-driven sports analytics. Identifying players from broadcast videos can aid with various downstream tasks such as player assessment, in-game analysis, and broadcast production. However, automatic detection of jersey numbers using deep features is challenging primarily due to: a) motion blur, b) low resolution video feed, and c) occlusions. With their recent success in various vision tasks, masked autoencoders (MAEs) have emerged as a superior alternative to conventional feature extractors. However, most MAEs simply zero-out image patches either randomly or focus on where to mask rather than how to mask. Motivated by human vision, we devise a novel domain-guided masking policy for MAEs termed d-MAE to facilitate robust feature extraction in the presence of motion blur for player identification. We further introduce a new spatio-temporal network leveraging our novel d-MAE for unique player identification. We conduct experiments on three large-scale sports datasets, including a curated baseball dataset, the SoccerNet dataset, and an in-house ice hockey dataset. We preprocess the datasets using an upgraded keyframe identification (KfID) module by focusing on frames containing jersey numbers. Additionally, we propose a keyframe-fusion technique to augment keyframes, preserving spatial and temporal context. Our spatio-temporal network showcases significant improvements, surpassing the current state-of-the-art by 8.58%, 4.29%, and 1.20% in the test set accuracies, respectively. Rigorous ablations highlight the effectiveness of our domain-guided masking approach and the refined KfID module, resulting in performance enhancements of 1.48% and 1.84% respectively, compared to original architectures.
Paper Structure (17 sections, 11 equations, 4 figures, 5 tables)

This paper contains 17 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example frames of various tracklets from three large-scale sports datasets showcasing the challenges: motion blur, occlusions and low-resolution.
  • Figure 2: Comparison of d-MAE with existing MAEs. (a) Existing MAEs zero-out/ blackout patches randomly while (b) We introduce motion blur artifacts on random patches. The masked patches in (b) are numbered from 1-5.
  • Figure 3: Overall architecture. Given a tracklet $\mathbb{T}$ consisting of $N$ frames, we pass $\mathbb{T}$ through the KfID module to extract $n \leq N$ keyframes that contain the jersey number. Each keyframe is passed as an input to our d-MAE encoder to extract spatial features $\mathcal{F}_s$. These features are then fed to the temporal transformer decoder to extract temporal features $\mathcal{F}_{\textrm{temp}}$. Two classification heads are utilized to compute the predicted digits of the jersey number $\hat{y}_1$ and $\hat{y}_2$ respectively.
  • Figure 4: Qualitative results. Performance of our model on five different player tracklets from all the three datasets. We find our model's prediction for each image separately and for the entire tracklet (Pred). GT represents the ground-truth value for the entire tracklet.