Table of Contents
Fetching ...

Precise Event Spotting in Sports Videos: Solving Long-Range Dependency and Class Imbalance

Sanchayan Santra, Vishal Chudasama, Pankaj Wasnik, Vineeth N. Balasubramanian

TL;DR

This work tackles Precise Event Spotting (PES) in long, untrimmed sports videos, addressing two core challenges: long-range temporal dependencies and severe class imbalance. It introduces an end-to-end network that combines a CNN-based spatio-temporal feature extractor (RegNetY) with Adaptive Spatio-Temporal Refinement Module (ASTRM) and a Bi-GRU temporal block, guided by the Soft Instance Contrastive (SoftIC) loss and optimized with Adaptive Sharpness-Aware Minimization (ASAM). The key contributions are the ASTRM module for enhanced local and global temporal context, the SoftIC loss to manage class imbalance under mixup, and an end-to-end training pipeline that yields state-of-the-art results on SoccerNet V2 and other sports datasets, especially in tight-tolerance settings. The approach reduces reliance on large frozen backbones and demonstrates robust performance with efficient computation, enabling precise indexing, summarization, and editing of sports videos in practical applications.

Abstract

Precise Event Spotting (PES) aims to identify events and their class from long, untrimmed videos, particularly in sports. The main objective of PES is to detect the event at the exact moment it occurs. Existing methods mainly rely on features from a large pre-trained network, which may not be ideal for the task. Furthermore, these methods overlook the issue of imbalanced event class distribution present in the data, negatively impacting performance in challenging scenarios. This paper demonstrates that an appropriately designed network, trained end-to-end, can outperform state-of-the-art (SOTA) methods. Particularly, we propose a network with a convolutional spatial-temporal feature extractor enhanced with our proposed Adaptive Spatio-Temporal Refinement Module (ASTRM) and a long-range temporal module. The ASTRM enhances the features with spatio-temporal information. Meanwhile, the long-range temporal module helps extract global context from the data by modeling long-range dependencies. To address the class imbalance issue, we introduce the Soft Instance Contrastive (SoftIC) loss that promotes feature compactness and class separation. Extensive experiments show that the proposed method is efficient and outperforms the SOTA methods, specifically in more challenging settings.

Precise Event Spotting in Sports Videos: Solving Long-Range Dependency and Class Imbalance

TL;DR

This work tackles Precise Event Spotting (PES) in long, untrimmed sports videos, addressing two core challenges: long-range temporal dependencies and severe class imbalance. It introduces an end-to-end network that combines a CNN-based spatio-temporal feature extractor (RegNetY) with Adaptive Spatio-Temporal Refinement Module (ASTRM) and a Bi-GRU temporal block, guided by the Soft Instance Contrastive (SoftIC) loss and optimized with Adaptive Sharpness-Aware Minimization (ASAM). The key contributions are the ASTRM module for enhanced local and global temporal context, the SoftIC loss to manage class imbalance under mixup, and an end-to-end training pipeline that yields state-of-the-art results on SoccerNet V2 and other sports datasets, especially in tight-tolerance settings. The approach reduces reliance on large frozen backbones and demonstrates robust performance with efficient computation, enabling precise indexing, summarization, and editing of sports videos in practical applications.

Abstract

Precise Event Spotting (PES) aims to identify events and their class from long, untrimmed videos, particularly in sports. The main objective of PES is to detect the event at the exact moment it occurs. Existing methods mainly rely on features from a large pre-trained network, which may not be ideal for the task. Furthermore, these methods overlook the issue of imbalanced event class distribution present in the data, negatively impacting performance in challenging scenarios. This paper demonstrates that an appropriately designed network, trained end-to-end, can outperform state-of-the-art (SOTA) methods. Particularly, we propose a network with a convolutional spatial-temporal feature extractor enhanced with our proposed Adaptive Spatio-Temporal Refinement Module (ASTRM) and a long-range temporal module. The ASTRM enhances the features with spatio-temporal information. Meanwhile, the long-range temporal module helps extract global context from the data by modeling long-range dependencies. To address the class imbalance issue, we introduce the Soft Instance Contrastive (SoftIC) loss that promotes feature compactness and class separation. Extensive experiments show that the proposed method is efficient and outperforms the SOTA methods, specifically in more challenging settings.

Paper Structure

This paper contains 16 sections, 10 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Temporal dependency of "Shots on target" event: The event is marked when the shot is taken, but whether the shot is at the target can only be known by looking at future frames. Here the shot is on target because the ball is saved by the goal keeper.
  • Figure 2: Per class score analysis on a few classes of SoccerNet V2 dataset in tight setting. The proposed method outperforms state-of-the-art (SOTA) methods especially for classes with less number of samples.
  • Figure 3: Network design of the proposed framework. The framework composed of a spatio-temporal feature extractor and a temporal block for capturing long-range dependency before the classifier. In each bottleneck block we add ASTRM after the first conv. ASTRM further enhances the features with local spatial, local temporal and global temporal information. The network is trained with SoftIC loss in addition to the classification loss to handle class imbalance.
  • Figure 4: Per-class score comparison on $\delta=0$ setting in terms of mAP for Tennis, FineGym, FS-Comp, and FS-Perf dataset. For FineGym, we report results for only 6 out of 32 classes, while results for all classes in the other datasets are reported.