Table of Contents
Fetching ...

XR-MBT: Multi-modal Full Body Tracking for XR through Self-Supervision with Learned Depth Point Cloud Registration

Denys Rozumnyi, Nadine Bertsch, Othman Sbai, Filippo Arcadu, Yuhua Chen, Artsiom Sanakoyeu, Manoj Kumar, Catherine Herold, Robin Kips

TL;DR

This work proposes to lever-age the available depth sensing signal combined with self-supervision to learn a multi-modal pose estimation model capable of tracking full body motions in real time on XR devices and demonstrates how current 3-point motion syn-thesis models can be extended to point cloud modalities using a semantic point cloud encoder network combined with a residual network for multi-modal pose estimation.

Abstract

Tracking the full body motions of users in XR (AR/VR) devices is a fundamental challenge to bring a sense of authentic social presence. Due to the absence of dedicated leg sensors, currently available body tracking methods adopt a synthesis approach to generate plausible motions given a 3-point signal from the head and controller tracking. In order to enable mixed reality features, modern XR devices are capable of estimating depth information of the headset surroundings using available sensors combined with dedicated machine learning models. Such egocentric depth sensing cannot drive the body directly, as it is not registered and is incomplete due to limited field-of-view and body self-occlusions. For the first time, we propose to leverage the available depth sensing signal combined with self-supervision to learn a multi-modal pose estimation model capable of tracking full body motions in real time on XR devices. We demonstrate how current 3-point motion synthesis models can be extended to point cloud modalities using a semantic point cloud encoder network combined with a residual network for multi-modal pose estimation. These modules are trained jointly in a self-supervised way, leveraging a combination of real unregistered point clouds and simulated data obtained from motion capture. We compare our approach against several state-of-the-art systems for XR body tracking and show that our method accurately tracks a diverse range of body motions. XR-MBT tracks legs in XR for the first time, whereas traditional synthesis approaches based on partial body tracking are blind.

XR-MBT: Multi-modal Full Body Tracking for XR through Self-Supervision with Learned Depth Point Cloud Registration

TL;DR

This work proposes to lever-age the available depth sensing signal combined with self-supervision to learn a multi-modal pose estimation model capable of tracking full body motions in real time on XR devices and demonstrates how current 3-point motion syn-thesis models can be extended to point cloud modalities using a semantic point cloud encoder network combined with a residual network for multi-modal pose estimation.

Abstract

Tracking the full body motions of users in XR (AR/VR) devices is a fundamental challenge to bring a sense of authentic social presence. Due to the absence of dedicated leg sensors, currently available body tracking methods adopt a synthesis approach to generate plausible motions given a 3-point signal from the head and controller tracking. In order to enable mixed reality features, modern XR devices are capable of estimating depth information of the headset surroundings using available sensors combined with dedicated machine learning models. Such egocentric depth sensing cannot drive the body directly, as it is not registered and is incomplete due to limited field-of-view and body self-occlusions. For the first time, we propose to leverage the available depth sensing signal combined with self-supervision to learn a multi-modal pose estimation model capable of tracking full body motions in real time on XR devices. We demonstrate how current 3-point motion synthesis models can be extended to point cloud modalities using a semantic point cloud encoder network combined with a residual network for multi-modal pose estimation. These modules are trained jointly in a self-supervised way, leveraging a combination of real unregistered point clouds and simulated data obtained from motion capture. We compare our approach against several state-of-the-art systems for XR body tracking and show that our method accurately tracks a diverse range of body motions. XR-MBT tracks legs in XR for the first time, whereas traditional synthesis approaches based on partial body tracking are blind.

Paper Structure

This paper contains 15 sections, 9 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: XR-MBT uses 3-points (head/wrists) and depth sensing from the XR device to learn real-time multi-modal body tracking. First, the synthesis stage generates a plausible body pose using the 3-Point signal. Then, the semantic point cloud network registers the body point cloud, and our self-supervised residual multi-modal pose estimation network predicts the refined body pose.
  • Figure 2: XR-MBT architecture. First, we use AGRoL du2023agrol to synthesize the initial pose (yellow). Second, we process the point cloud by the Semantic Point Cloud (SPC) encoder (green) to generate point features. Third, the SPC decoder (blue) generates the probability of mapping of every point to a body joint, which is used for self-supervised learning. Last, the Multi-modal Pose Estimation (MPE) network (orange) estimates the final body pose by combining all modalities. MPE and SPC networks are jointly trained with a combination of Mocap and unlabeled depth data.
  • Figure 3: An example of real 3-Point and point cloud data accessible from XR devices and the corresponding output of our semantic point cloud (SPC) network. Even though the body point cloud is partial due to self-occlusions and outside the field of view, our SPC network learns to correctly identify the various body point joints by leveraging the 3-Point information. This indicates that the SPC network learns a meaningful registration of the body point cloud to supervise the MPE network training.
  • Figure 4: Comparison of our XR-MBT method to other XR body tracking methods on real data. By leveraging multi-modal inputs in addition to the 3-Point data, our method is able to cover a large variety of lower body motions while preserving the plausibility of the pose for body parts outside the field of view of the depth sensor.
  • Figure 5: MPJPE for lower body computed on the Mocap test set with different action labels. We compare AGRoL du2023agrol to our method trained with PC-loss or SPC-loss \ref{['eq:spc']}.
  • ...and 3 more figures