Table of Contents
Fetching ...

Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation

Daniel Kienzle, Katja Ludwig, Julian Lorenz, Shin'ichi Satoh, Rainer Lienhart

TL;DR

This work tackles the problem of reconstructing accurate 3D table tennis ball trajectories and spin from monocular video, where real-world noise and lack of 3D ground truth hinder prior approaches. It introduces a robust two-stage pipeline: a front-end for 2D ball and table keypoint detection trained on the new TTHQ dataset, and a back-end 2D-to-3D uplifting model trained solely on synthetic data, augmented to handle missing detections and varying frame rates via time-aware RoPE embeddings. The key contributions include high-resolution detectors based on Segformer++ for both ball and table geometry, a transformer-based back-end that generalizes to real data, and a comprehensive dataset (TTHQ) with 2D annotations and spin labels; together they enable an end-to-end tool for 3D trajectory and spin analysis in real-world broadcast footage. The results show strong 2D detection performance, robust 3D uplift under real-world imperfections, and high spin classification accuracy, making the pipeline practical for sports analytics and extensible to other 3D reconstruction tasks with limited ground truth.

Abstract

Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.

Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation

TL;DR

This work tackles the problem of reconstructing accurate 3D table tennis ball trajectories and spin from monocular video, where real-world noise and lack of 3D ground truth hinder prior approaches. It introduces a robust two-stage pipeline: a front-end for 2D ball and table keypoint detection trained on the new TTHQ dataset, and a back-end 2D-to-3D uplifting model trained solely on synthetic data, augmented to handle missing detections and varying frame rates via time-aware RoPE embeddings. The key contributions include high-resolution detectors based on Segformer++ for both ball and table geometry, a transformer-based back-end that generalizes to real data, and a comprehensive dataset (TTHQ) with 2D annotations and spin labels; together they enable an end-to-end tool for 3D trajectory and spin analysis in real-world broadcast footage. The results show strong 2D detection performance, robust 3D uplift under real-world imperfections, and high spin classification accuracy, making the pipeline practical for sports analytics and extensible to other 3D reconstruction tasks with limited ground truth.

Abstract

Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.

Paper Structure

This paper contains 19 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Qualitative example prediction of the full pipeline for a serve trajectory. The green dots represent the front-end detections for 2D ball positions and table keypoints. The magenta dots represent the predicted 3D ball trajectory from the back-end.
  • Figure 2: Overview of the proposed pipeline. In the front-end stage, we detect the 2D ball position and localize the 13 table keypoints in each frame $n$ at time $t_n$. After robust filtering, we obtain a clean 2D ball trajectory $\{\vec{r}_\text{2D}(t_n)\}_{n=0}^{N-1}$ with $N$ being the number of frames in the trajectory. As we assume a static camera, we obtain a single, time-independent set of table keypoints $\{\vec{r}_{\text{table,}k}\}_{k=1}^{13}$ after filtering. In the back-end stage, the coordinates are embedded into a location token $l_n$ for each timestep $t_n$, a learnable spin token $s$ is prepended, and the sequence is then processed by the uplifting network to predict the 3D trajectory $\{\vec{r}_\text{3D}(t_n)\}_{n=0}^{N-1}$ and the initial spin $\vec{\omega}(t_0)$. The blue color represents modules with learnable parameters.
  • Figure 3: Definition of the 13 Table Keypoints and illustration of the world coordinate system axes.
  • Figure 4: Embedding Module. The detected ball position $\vec{r}_\text{2D}(t_n)$ at time $t_n$ and all visible table keypoints are projected into a higher dimensional space by a 2-layer MLP and then processed by a 4-block transformer. Finally, only the token corresponding to the ball position is kept as location token $l_n$.
  • Figure 5: Uplifting Network. The input consists of the learnable spin token $s$ and the sequence of location tokens $\{l_0, ..., l_{N-1}\}$ obtained from the embedding module. The first stage consisting of $L-4$ transformer blocks computes the 3D trajectory. The second stage consisting of 4 transformer blocks computes the initial spin. To obtain the final three-dimensional output vectors, we apply a small 3-layer MLP as head for both the trajectory and the spin.
  • ...and 3 more figures