Precise Event Spotting in Sports Videos: Solving Long-Range Dependency and Class Imbalance
Sanchayan Santra, Vishal Chudasama, Pankaj Wasnik, Vineeth N. Balasubramanian
TL;DR
This work tackles Precise Event Spotting (PES) in long, untrimmed sports videos, addressing two core challenges: long-range temporal dependencies and severe class imbalance. It introduces an end-to-end network that combines a CNN-based spatio-temporal feature extractor (RegNetY) with Adaptive Spatio-Temporal Refinement Module (ASTRM) and a Bi-GRU temporal block, guided by the Soft Instance Contrastive (SoftIC) loss and optimized with Adaptive Sharpness-Aware Minimization (ASAM). The key contributions are the ASTRM module for enhanced local and global temporal context, the SoftIC loss to manage class imbalance under mixup, and an end-to-end training pipeline that yields state-of-the-art results on SoccerNet V2 and other sports datasets, especially in tight-tolerance settings. The approach reduces reliance on large frozen backbones and demonstrates robust performance with efficient computation, enabling precise indexing, summarization, and editing of sports videos in practical applications.
Abstract
Precise Event Spotting (PES) aims to identify events and their class from long, untrimmed videos, particularly in sports. The main objective of PES is to detect the event at the exact moment it occurs. Existing methods mainly rely on features from a large pre-trained network, which may not be ideal for the task. Furthermore, these methods overlook the issue of imbalanced event class distribution present in the data, negatively impacting performance in challenging scenarios. This paper demonstrates that an appropriately designed network, trained end-to-end, can outperform state-of-the-art (SOTA) methods. Particularly, we propose a network with a convolutional spatial-temporal feature extractor enhanced with our proposed Adaptive Spatio-Temporal Refinement Module (ASTRM) and a long-range temporal module. The ASTRM enhances the features with spatio-temporal information. Meanwhile, the long-range temporal module helps extract global context from the data by modeling long-range dependencies. To address the class imbalance issue, we introduce the Soft Instance Contrastive (SoftIC) loss that promotes feature compactness and class separation. Extensive experiments show that the proposed method is efficient and outperforms the SOTA methods, specifically in more challenging settings.
