Table of Contents
Fetching ...

RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

Linfeng Dong, Yuchen Yang, Hao Wu, Wei Wang, Yuenan Hou, Zhihang Zhong, Xiao Sun

TL;DR

RacketVision introduces a large-scale, multi-sport benchmark for unified ball and racket analysis across badminton, tennis, and table tennis, with three interconnected tasks: ball tracking, racket pose estimation, and ball trajectory prediction. The dataset provides pixel-level ball and racket annotations, along with a two-stage annotation pipeline and a three-task training workflow. A central finding is that naive fusion of racket pose features harms trajectory prediction, whereas Cross-Attention fusion enables robust integration of racket cues, improving performance over strong unimodal baselines. The work demonstrates that multi-sport training fosters generalization and establishes a new public resource for dynamic object tracking and multimodal sports analytics.

Abstract

We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision

RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

TL;DR

RacketVision introduces a large-scale, multi-sport benchmark for unified ball and racket analysis across badminton, tennis, and table tennis, with three interconnected tasks: ball tracking, racket pose estimation, and ball trajectory prediction. The dataset provides pixel-level ball and racket annotations, along with a two-stage annotation pipeline and a three-task training workflow. A central finding is that naive fusion of racket pose features harms trajectory prediction, whereas Cross-Attention fusion enables robust integration of racket cues, improving performance over strong unimodal baselines. The work demonstrates that multi-sport training fosters generalization and establishes a new public resource for dynamic object tracking and multimodal sports analytics.

Abstract

We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision

Paper Structure

This paper contains 17 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visual examples of annotated data samples in RacketVision across the three sports. Each panel displays annotations for the ball's position (red dot) and the racket's bounding box (orange rectangle). The insets of each panel provide a schematic of the five keypoints defined for each specific racket type, which are used for the racket pose estimation task.
  • Figure 2: The two-stage annotation pipeline for RacketVision. First, crowd-sourced annotators segment valid clips from raw videos where the ball is in motion. Second, on sparsely sampled frames from these clips, another group of annotators labels the ball's position as well as the racket's bounding box and keypoints using a specialized interface.
  • Figure 3: An overview of the task pipeline in RacketVision. Initially, the Ball Tracker and Racket Pose Estimator are trained using sparse ground-truth annotations. These models then process full video clips to generate dense trajectory data (soft labels), which serves as the training input for the final Ball Trajectory Predictor.
  • Figure 4: The visualization of ball tracking result of MS-TrackNetV3 (with BM, #F=4) on table tennis. The red dots are sparse ground-truth ball position annotations, while the green dots are model predictions. The yellow line shows the combined path of ground-truth and predictions, illustrating the complete ball trajectory within the clip.
  • Figure 5: Visualization result of racket pose estimation of MS RTMPose model on tennis clip.
  • ...and 1 more figures