Table of Contents
Fetching ...

From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models

German Barquero, Nadine Bertsch, Manojkumar Marramreddy, Carlos Chacón, Filippo Arcadu, Ferran Rigual, Nicky Sijia He, Cristina Palmero, Sergio Escalera, Yuting Ye, Robin Kips

TL;DR

This work tackles real-time full-body motion generation in XR from temporally and spatially sparse inputs, notably hand tracking that may drop out. It introduces RPM, an online autoregressive framework that progressively refines a short horizon of future poses, coupled with a Prediction Consistency Anchor Function (PCAF) to control the balance between accuracy and smoothness during tracking-to-synthesis transitions. A free-running training strategy trains the system to cope with its own errors, enhancing stability under input gaps. The authors also present GORP, a real VR dataset with real tracking signals and ground-truth motion, to benchmark performance in realistic conditions. Results show RPM achieves smoother, more plausible transitions with competitive accuracy, and the dataset reveals a gap between synthetic benchmarks and real-world signals, underscoring the practical impact of robust, real-time motion generation for XR avatars.

Abstract

In extended reality (XR), generating full-body motion of the users is important to understand their actions, drive their virtual avatars for social interaction, and convey a realistic sense of presence. While prior works focused on spatially sparse and always-on input signals from motion controllers, many XR applications opt for vision-based hand tracking for reduced user friction and better immersion. Compared to controllers, hand tracking signals are less accurate and can even be missing for an extended period of time. To handle such unreliable inputs, we present Rolling Prediction Model (RPM), an online and real-time approach that generates smooth full-body motion from temporally and spatially sparse input signals. Our model generates 1) accurate motion that matches the inputs (i.e., tracking mode) and 2) plausible motion when inputs are missing (i.e., synthesis mode). More importantly, RPM generates seamless transitions from tracking to synthesis, and vice versa. To demonstrate the practical importance of handling noisy and missing inputs, we present GORP, the first dataset of realistic sparse inputs from a commercial virtual reality (VR) headset with paired high quality body motion ground truth. GORP provides >14 hours of VR gameplay data from 28 people using motion controllers (spatially sparse) and hand tracking (spatially and temporally sparse). We benchmark RPM against the state of the art on both synthetic data and GORP to highlight how we can bridge the gap for real-world applications with a realistic dataset and by handling unreliable input signals. Our code, pretrained models, and GORP dataset are available in the project webpage.

From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models

TL;DR

This work tackles real-time full-body motion generation in XR from temporally and spatially sparse inputs, notably hand tracking that may drop out. It introduces RPM, an online autoregressive framework that progressively refines a short horizon of future poses, coupled with a Prediction Consistency Anchor Function (PCAF) to control the balance between accuracy and smoothness during tracking-to-synthesis transitions. A free-running training strategy trains the system to cope with its own errors, enhancing stability under input gaps. The authors also present GORP, a real VR dataset with real tracking signals and ground-truth motion, to benchmark performance in realistic conditions. Results show RPM achieves smoother, more plausible transitions with competitive accuracy, and the dataset reveals a gap between synthetic benchmarks and real-world signals, underscoring the practical impact of robust, real-time motion generation for XR avatars.

Abstract

In extended reality (XR), generating full-body motion of the users is important to understand their actions, drive their virtual avatars for social interaction, and convey a realistic sense of presence. While prior works focused on spatially sparse and always-on input signals from motion controllers, many XR applications opt for vision-based hand tracking for reduced user friction and better immersion. Compared to controllers, hand tracking signals are less accurate and can even be missing for an extended period of time. To handle such unreliable inputs, we present Rolling Prediction Model (RPM), an online and real-time approach that generates smooth full-body motion from temporally and spatially sparse input signals. Our model generates 1) accurate motion that matches the inputs (i.e., tracking mode) and 2) plausible motion when inputs are missing (i.e., synthesis mode). More importantly, RPM generates seamless transitions from tracking to synthesis, and vice versa. To demonstrate the practical importance of handling noisy and missing inputs, we present GORP, the first dataset of realistic sparse inputs from a commercial virtual reality (VR) headset with paired high quality body motion ground truth. GORP provides >14 hours of VR gameplay data from 28 people using motion controllers (spatially sparse) and hand tracking (spatially and temporally sparse). We benchmark RPM against the state of the art on both synthetic data and GORP to highlight how we can bridge the gap for real-world applications with a realistic dataset and by handling unreliable input signals. Our code, pretrained models, and GORP dataset are available in the project webpage.

Paper Structure

This paper contains 21 sections, 7 equations, 15 figures, 9 tables, 1 algorithm.

Figures (15)

  • Figure 1: We introduce Rolling Prediction Model, an approach that generates smooth and realistic full-body human motion in the two of the most common XR sensing signals: hand controllers, in which the tracking signal is always available (left), and hand tracking, in which the tracking signal is noisy and might be lost for long periods of time (right). Tracking input trajectories are shown as magenta lines.
  • Figure 2: Our RPM is conditioned on the past generated motion and the past and present tracking inputs. It outputs the predicted motion, which is fed to the PCAF module in the next iteration.
  • Figure 3: While in free-running, the tracking signal and the generated motion might misalign. When applying a distance-based loss, the gradient pushes to correct the predicted motion and immediately match the tracking signal. This makes the model generate jittery motion and abrupt transitions after tracking input losses (top row). PCAF forces the magnitude of the correction to be within the bounds of the PCAF uncertainty (bottom row).
  • Figure 4: Flexible reactiveness. RPM achieves the best accuracy when the prediction length is around 8 frames (or 133ms), and the smoothest results around 15 frames (or 250ms). By leveraging longer prediction windows, we can trade off smoothness for tracking reactiveness (i.e., lower jitter and peak jerk, higher hands PE).
  • Figure 5: Trajectory prediction. RPM decomposes the generation of motion into a progressive refinement of the predicted $W$ next poses, shown above as magenta dots, connected by lines. On top, we observe how RPM can predict fast dynamic motion and generate expressive and realistic motion, even during tracking signal losses. Below, we show how RPM generates a smooth transition when recovering from a hand-tracking loss (left hand).
  • ...and 10 more figures