Table of Contents
Fetching ...

AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clapés

TL;DR

A simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing, which achieves state-of-the-art performance under strict evaluation metrics.

Abstract

Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \textbf{AdaSpot}, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that \textbf{AdaSpot} achieves state-of-the-art performance under strict evaluation metrics (\eg, $+3.96$ and $+2.26$ mAP$@0$ frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code is available at: \href{https://github.com/arturxe2/AdaSpot}{https://github.com/arturxe2/AdaSpot}.

AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting

TL;DR

A simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing, which achieves state-of-the-art performance under strict evaluation metrics.

Abstract

Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \textbf{AdaSpot}, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that \textbf{AdaSpot} achieves state-of-the-art performance under strict evaluation metrics (\eg, and mAP frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code is available at: \href{https://github.com/arturxe2/AdaSpot}{https://github.com/arturxe2/AdaSpot}.
Paper Structure (36 sections, 8 figures, 12 tables)

This paper contains 36 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Illustration of standard PES approaches: (a) high-resolution videos incur high computational cost, whereas (b) low-resolution videos reduce cost but lose fine-grained details crucial for precise temporal localization. In contrast, (c) AdaSpot captures global context from low-resolution videos and adaptively applies high-resolution processing to task-relevant regions, preserving fine-grained details efficiently.
  • Figure 2: Overview of our proposed method, AdaSpot. (a) The framework uses a low-resolution extractor to process low-resolution clips and generate global features $F_l$ and spatial maps $F_s$. A RoI selector leverages $F_s$ to identify the most relevant region in each frame. The resulting RoI sequence is then processed by a high-resolution extractor to capture fine-grained features, $F_h$. $F_l$ and $F_h$ are linearly projected, aggregated, and passed through a temporal modeler, before a prediction head produces per-frame classifications. (b) Details of the RoI selector: channel averaging generates saliency maps from $F_s$, spatio-temporal smoothing reduces noise, and adaptive-scale RoI selection adjusts the RoI size to the saliency spread.
  • Figure 3: Comparison of AdaSpot, a single-branch baseline, and redundancy-aware alternatives across multiple spatial resolutions on Tennis (left) and SN-BAS (right). Each point corresponds to a model configuration, with GFLOPs on the x-axis, mAP on the y-axis, and point size indicating the number of parameters. Models closer to the upper-left with smaller markers achieve better accuracy-efficiency trade-offs.
  • Figure 4: Qualitative visualization of the saliency maps and the corresponding RoIs selected by AdaSpot across all evaluated datasets: Tennis, FineDiving, FineGym, F3Set, and SN-BAS. In FineDiving and FineGym, events revolve around a main athlete, whereas in Tennis, F3Set, and SN-BAS, they revolve around the ball, which is marked with a star for clarity.
  • Figure 5: Illustration of the taxonomy of methods addressing spatio-temporal redundancy. We categorize approaches into architecture-based and input-based, and indicate whether each method handles spatial redundancy, temporal redundancy, or both.
  • ...and 3 more figures