Table of Contents
Fetching ...

Towards Versatile Humanoid Table Tennis: Unified Reinforcement Learning with Prediction Augmentation

Muqun Hu, Wenxi Chen, Wenjing Li, Falak Mandali, Zijian He, Renhong Zhang, Praveen Krisna, Katherine Christian, Leo Benaharon, Dizhi Ma, Karthik Ramani, Yan Gu

TL;DR

The paper tackles the challenge of versatile humanoid table tennis by proposing a unified end-to-end reinforcement learning framework that maps ball-position observations and proprioception to whole-body motions for both striking and locomotion. A lightweight ball trajectory predictor augments the actor's observations, and physics-based dense rewards guide learning, enabling proactive footwork and accurate returns. Ablations show the predictor and prediction-based rewards are critical for effective end-to-end learning, with strong simulation results and zero-shot transfer to a 23-DoF Booster T1 humanoid. The work demonstrates a practical path toward versatile TT play, combining Sim2Real transfer with a compact, unified control policy and offering avenues for future improvements in dexterity and curriculum learning.

Abstract

Humanoid table tennis (TT) demands rapid perception, proactive whole-body motion, and agile footwork under strict timing -- capabilities that remain difficult for unified controllers. We propose a reinforcement learning framework that maps ball-position observations directly to whole-body joint commands for both arm striking and leg locomotion, strengthened by predictive signals and dense, physics-guided rewards. A lightweight learned predictor, fed with recent ball positions, estimates future ball states and augments the policy's observations for proactive decision-making. During training, a physics-based predictor supplies precise future states to construct dense, informative rewards that lead to effective exploration. The resulting policy attains strong performance across varied serve ranges (hit rate $\geq$ 96% and success rate $\geq$ 92%) in simulations. Ablation studies confirm that both the learned predictor and the predictive reward design are critical for end-to-end learning. Deployed zero-shot on a physical Booster T1 humanoid with 23 revolute joints, the policy produces coordinated lateral and forward-backward footwork with accurate, fast returns, suggesting a practical path toward versatile, competitive humanoid TT.

Towards Versatile Humanoid Table Tennis: Unified Reinforcement Learning with Prediction Augmentation

TL;DR

The paper tackles the challenge of versatile humanoid table tennis by proposing a unified end-to-end reinforcement learning framework that maps ball-position observations and proprioception to whole-body motions for both striking and locomotion. A lightweight ball trajectory predictor augments the actor's observations, and physics-based dense rewards guide learning, enabling proactive footwork and accurate returns. Ablations show the predictor and prediction-based rewards are critical for effective end-to-end learning, with strong simulation results and zero-shot transfer to a 23-DoF Booster T1 humanoid. The work demonstrates a practical path toward versatile TT play, combining Sim2Real transfer with a compact, unified control policy and offering avenues for future improvements in dexterity and curriculum learning.

Abstract

Humanoid table tennis (TT) demands rapid perception, proactive whole-body motion, and agile footwork under strict timing -- capabilities that remain difficult for unified controllers. We propose a reinforcement learning framework that maps ball-position observations directly to whole-body joint commands for both arm striking and leg locomotion, strengthened by predictive signals and dense, physics-guided rewards. A lightweight learned predictor, fed with recent ball positions, estimates future ball states and augments the policy's observations for proactive decision-making. During training, a physics-based predictor supplies precise future states to construct dense, informative rewards that lead to effective exploration. The resulting policy attains strong performance across varied serve ranges (hit rate 96% and success rate 92%) in simulations. Ablation studies confirm that both the learned predictor and the predictive reward design are critical for end-to-end learning. Deployed zero-shot on a physical Booster T1 humanoid with 23 revolute joints, the policy produces coordinated lateral and forward-backward footwork with accurate, fast returns, suggesting a practical path toward versatile, competitive humanoid TT.

Paper Structure

This paper contains 36 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The Booster T1 humanoid successfully returns a high-speed ball (6 m/s) from a serving machine. The learned end-to-end whole-body control policy achieves a rapid 0.5 second interception and return, demonstrating coordinated hand-leg movements. Supplementary video: https://youtu.be/vzXuCIXpLaE
  • Figure 2: Overview of the training pipeline. A learnable predictor anticipates the future desired hitting position of an incoming ball, $\tilde{\mathbf{p}}_{ball}$. Physics-based simulation provides a dynamics-model-based prediction, $\hat{\mathbf{p}}_{ball}$, which serves both as ground truth for training the predictor and as the basis for constructing dense, continuous rewards (illustrated in Fig. \ref{['fig: prediction-based reward']}).
  • Figure 3: Prediction-based reward design for post-stroke motions. Using the ball physics in simulation, we anticipate the ball’s trajectory (green dashed line). From this, we define a hit-guidance reward that encourages the robot to move proactively to intercept the ball (as visualized by the solid green and blue arrows), and a return-guidance reward that scores each strike based on the predicted landing point and the ball’s height at the net. Together, these continuous rewards enable the robot to refine its strikes and achieve successful returns.
  • Figure 4: Time-lapse sequences of two consecutive rallies by T1. On the top row, the arm was predominantly utilized to hit the first ball. On the bottom row, in contrast, the entire trunk rotates to hit the second ball.
  • Figure 5: Success strike positions under different serve ranges. Each dot marks the paddle contact point of a successful return. Short and long serves are distinguished by their initial velocity along the $x$-axis of the TT table, consistent with the specifications in Table \ref{['table: performance in simulation']}.
  • ...and 2 more figures