Table of Contents
Fetching ...

Multi-Focus Temporal Shifting for Precise Event Spotting in Sports Videos

Hao Xu, Xinyu Wei, Sam Wells, Sunil Aryal

TL;DR

This work tackles precise event spotting in sports videos by addressing the limited temporal receptive field and poor spatial selectivity of existing GSM-based methods. It introduces Multi-Focus Temporal Shifting (MFS), a lightweight extension that combines multi-scale temporal shifts with a Grouped Focus Module (GFM) to capture long- and short-term context while focusing on salient regions. The authors also present the Table Tennis Australia (TTA) dataset, a realistic PES benchmark with dense event annotations. Across five PES benchmarks and multiple backbones, MFS achieves state-of-the-art performance among lightweight methods at substantially lower FLOPs, with ablations confirming the benefits of jointly employing multi-scale shifts and grouped attention.

Abstract

Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as the Gate Shift Module (GSM) or the Gate Shift Fuse to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose Multi-Focus Temporal Shifting Module (MFS) that enhances GSM with multi-scale temporal shifts and Group Focus Module, enabling efficient modeling of both short and long-term dependencies while focusing on salient regions. MFS is a lightweight, plug-and-play module that integrates seamlessly with diverse 2D backbones. To further advance the field, we introduce the Table Tennis Australia dataset, the first PES benchmark for table tennis containing over 4,800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MFS consistently improves performance with minimal overhead, achieving leading results among lightweight methods (+4.09 mAP, 45 GFLOPs).

Multi-Focus Temporal Shifting for Precise Event Spotting in Sports Videos

TL;DR

This work tackles precise event spotting in sports videos by addressing the limited temporal receptive field and poor spatial selectivity of existing GSM-based methods. It introduces Multi-Focus Temporal Shifting (MFS), a lightweight extension that combines multi-scale temporal shifts with a Grouped Focus Module (GFM) to capture long- and short-term context while focusing on salient regions. The authors also present the Table Tennis Australia (TTA) dataset, a realistic PES benchmark with dense event annotations. Across five PES benchmarks and multiple backbones, MFS achieves state-of-the-art performance among lightweight methods at substantially lower FLOPs, with ablations confirming the benefits of jointly employing multi-scale shifts and grouped attention.

Abstract

Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as the Gate Shift Module (GSM) or the Gate Shift Fuse to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose Multi-Focus Temporal Shifting Module (MFS) that enhances GSM with multi-scale temporal shifts and Group Focus Module, enabling efficient modeling of both short and long-term dependencies while focusing on salient regions. MFS is a lightweight, plug-and-play module that integrates seamlessly with diverse 2D backbones. To further advance the field, we introduce the Table Tennis Australia dataset, the first PES benchmark for table tennis containing over 4,800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MFS consistently improves performance with minimal overhead, achieving leading results among lightweight methods (+4.09 mAP, 45 GFLOPs).

Paper Structure

This paper contains 12 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of a table-tennis serve sequence. Identifying the actual serve event requires information from earlier frames showing the ball toss and preparation, as well as later frames showing the completion phase. This demonstrates that temporal cues beyond t±1 are necessary for accurate event spotting.
  • Figure 2: Heatmap visualizations across three event types: Row 1 — occluded ball-bounce; Row 2 — normal ball-bounce; Row 3 — jump-landing. MFS yields sharper, more meaningful focus regions, attending to players/table when the ball is occluded, to ball–racket interactions during normal bounce, and to the skater’s boot at landing.
  • Figure 3: Grad-CAM heatmaps on Tennis and TTA. Row 1 (Group 1) and Row 2 (Group 2) attend to different event-critical regions—e.g., the two players in Tennis, and the table versus the ball in TTA—showing that grouped focusing produces complementary spatial attention.
  • Figure 4: Multiscale Shift overview. A 3D CNN generates gate maps for different temporal shift ranges (e.g., $\Delta t = \pm1, \pm3$). Features are shifted bidirectionally across multiple scales, gated, and fused through learnable weights, enabling adaptive capture of both short and long-term temporal dependencies.
  • Figure 5: Comparison of event density (average number of events) across varying temporal window sizes. TTA shows the highest event density across all ranges, reflecting its fine-grained and fast-paced nature.