Table of Contents
Fetching ...

TrackNetV4: Enhancing Fast Sports Object Tracking with Motion Attention Maps

Arjun Raj, Lei Wang, Tom Gedeon

TL;DR

This paper introduces an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball’s location and improving tracking performance.

Abstract

Accurately detecting and tracking high-speed, small objects, such as balls in sports videos, is challenging due to factors like motion blur and occlusion. Although recent deep learning frameworks like TrackNetV1, V2, and V3 have advanced tennis ball and shuttlecock tracking, they often struggle in scenarios with partial occlusion or low visibility. This is primarily because these models rely heavily on visual features without explicitly incorporating motion information, which is crucial for precise tracking and trajectory prediction. In this paper, we introduce an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball's location and improving tracking performance. Our approach leverages frame differencing maps, modulated by a motion prompt layer, to highlight key motion regions over time. Experimental results on the tennis ball and shuttlecock datasets show that our method enhances the tracking performance of both TrackNetV2 and V3. We refer to our lightweight, plug-and-play solution, built on top of the existing TrackNet, as TrackNetV4.

TrackNetV4: Enhancing Fast Sports Object Tracking with Motion Attention Maps

TL;DR

This paper introduces an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball’s location and improving tracking performance.

Abstract

Accurately detecting and tracking high-speed, small objects, such as balls in sports videos, is challenging due to factors like motion blur and occlusion. Although recent deep learning frameworks like TrackNetV1, V2, and V3 have advanced tennis ball and shuttlecock tracking, they often struggle in scenarios with partial occlusion or low visibility. This is primarily because these models rely heavily on visual features without explicitly incorporating motion information, which is crucial for precise tracking and trajectory prediction. In this paper, we introduce an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball's location and improving tracking performance. Our approach leverages frame differencing maps, modulated by a motion prompt layer, to highlight key motion regions over time. Experimental results on the tennis ball and shuttlecock datasets show that our method enhances the tracking performance of both TrackNetV2 and V3. We refer to our lightweight, plug-and-play solution, built on top of the existing TrackNet, as TrackNetV4.
Paper Structure (6 sections, 4 equations, 17 figures, 2 tables)

This paper contains 6 sections, 4 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: A visual comparison is presented between (a) the original video frame, (b) the learned motion attention map, and (c) the motion-prompted frame. Tracking shuttlecocks is challenging due to their small size and tendency to blend into the background. To address this, we use a motion prompt layer chen2024motion to generate motion attention maps that highlight the shuttlecock's location. We also create a motion-prompted frame by performing element-wise multiplication between the motion attention map and the original video frame, showing how motion features enhance visual representation. For better visualization, the shuttlecocks in these frames are zoomed in on the right.
  • Figure 2: We propose using learnable motion attention maps to enhance the tracking of high-speed, small objects in video frames. While demonstrated with TrackNetV2, our approach can be seamlessly integrated into any heatmap-based detection and tracking framework. Our method uses a motion prompt layer chen2024motion on frame differencing maps (using absolute values to capture both positive and negative pixel intensity changes, thereby reducing missed detections) to generate motion attention maps that highlight key motion regions, such as balls. These maps are then fused with high-level visual features before the heatmap output layer through element-wise multiplication, followed by concatenation. The tracking framework that features our motion-aware fusion is named TrackNetV4.
  • Figure 3: Comparison between (a) original frame differencing maps ${\bm{\mathsfit{D}}}_t$ and (b) absolute frame differencing maps ${\bm{\mathsfit{D}}}_t^{+}$, both using the same normalization function from chen2024motion with a slope of 16.24 and a shift of 0.28 for visualization. Our approach captures both positive and negative intensity changes, ensuring key motions for tracking and prediction are not missed, unlike chen2024motion, which maps negative values to 0.
  • Figure 4: Comparison of feature maps and heatmaps with and without motion-aware fusion. Four visualization groups are shown, with the first row in each displaying the original frames. Motion-aware fusion improves visual representations (e.g., 2nd vs. 3rd row in (a)), resulting in clearer, more accurate ball predictions ((a) and (c)). Combined with high-level features, motion attention further refines ball localization (e.g., 4th vs. 5th row in (c)), reducing missed detections compared to the baseline ((b) and (d)). This demonstrates how motion awareness enhances tracking of fast-moving, small objects.
  • Figure 5: TrackNetV1.
  • ...and 12 more figures