Multi-Focus Temporal Shifting for Precise Event Spotting in Sports Videos
Hao Xu, Xinyu Wei, Sam Wells, Sunil Aryal
TL;DR
This work tackles precise event spotting in sports videos by addressing the limited temporal receptive field and poor spatial selectivity of existing GSM-based methods. It introduces Multi-Focus Temporal Shifting (MFS), a lightweight extension that combines multi-scale temporal shifts with a Grouped Focus Module (GFM) to capture long- and short-term context while focusing on salient regions. The authors also present the Table Tennis Australia (TTA) dataset, a realistic PES benchmark with dense event annotations. Across five PES benchmarks and multiple backbones, MFS achieves state-of-the-art performance among lightweight methods at substantially lower FLOPs, with ablations confirming the benefits of jointly employing multi-scale shifts and grouped attention.
Abstract
Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as the Gate Shift Module (GSM) or the Gate Shift Fuse to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose Multi-Focus Temporal Shifting Module (MFS) that enhances GSM with multi-scale temporal shifts and Group Focus Module, enabling efficient modeling of both short and long-term dependencies while focusing on salient regions. MFS is a lightweight, plug-and-play module that integrates seamlessly with diverse 2D backbones. To further advance the field, we introduce the Table Tennis Australia dataset, the first PES benchmark for table tennis containing over 4,800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MFS consistently improves performance with minimal overhead, achieving leading results among lightweight methods (+4.09 mAP, 45 GFLOPs).
