Table of Contents
Fetching ...

Real-Time Simulated Avatar from Head-Mounted Sensors

Zhengyi Luo, Jinkun Cao, Rawal Khirodkar, Alexander Winkler, Jing Huang, Kris Kitani, Weipeng Xu

TL;DR

SimXR presents an end-to-end framework that directly maps head-mounted sensor data—images from XR cameras and headset pose—to control signals for a simulated humanoid, enabling realistic full-body avatar motion in real time. The method distills knowledge from a pretrained motion imitator, allowing efficient training that leverages synthetic XR data while remaining applicable to real-world AR/VR devices. It demonstrates strong pose accuracy and physical plausibility on both VR and AR headset configurations, with comprehensive ablations and comparisons against state-of-the-art baselines. The work provides large synthetic and real-world datasets to promote public research, and discusses failure cases and future directions to further improve temporal coherence and scene-aware motion realism.

Abstract

We present SimXR, a method for controlling a simulated avatar from information (headset pose and cameras) obtained from AR / VR headsets. Due to the challenging viewpoint of head-mounted cameras, the human body is often clipped out of view, making traditional image-based egocentric pose estimation challenging. On the other hand, headset poses provide valuable information about overall body motion, but lack fine-grained details about the hands and feet. To synergize headset poses with cameras, we control a humanoid to track headset movement while analyzing input images to decide body movement. When body parts are seen, the movements of hands and feet will be guided by the images; when unseen, the laws of physics guide the controller to generate plausible motion. We design an end-to-end method that does not rely on any intermediate representations and learns to directly map from images and headset poses to humanoid control signals. To train our method, we also propose a large-scale synthetic dataset created using camera configurations compatible with a commercially available VR headset (Quest 2) and show promising results on real-world captures. To demonstrate the applicability of our framework, we also test it on an AR headset with a forward-facing camera.

Real-Time Simulated Avatar from Head-Mounted Sensors

TL;DR

SimXR presents an end-to-end framework that directly maps head-mounted sensor data—images from XR cameras and headset pose—to control signals for a simulated humanoid, enabling realistic full-body avatar motion in real time. The method distills knowledge from a pretrained motion imitator, allowing efficient training that leverages synthetic XR data while remaining applicable to real-world AR/VR devices. It demonstrates strong pose accuracy and physical plausibility on both VR and AR headset configurations, with comprehensive ablations and comparisons against state-of-the-art baselines. The work provides large synthetic and real-world datasets to promote public research, and discusses failure cases and future directions to further improve temporal coherence and scene-aware motion realism.

Abstract

We present SimXR, a method for controlling a simulated avatar from information (headset pose and cameras) obtained from AR / VR headsets. Due to the challenging viewpoint of head-mounted cameras, the human body is often clipped out of view, making traditional image-based egocentric pose estimation challenging. On the other hand, headset poses provide valuable information about overall body motion, but lack fine-grained details about the hands and feet. To synergize headset poses with cameras, we control a humanoid to track headset movement while analyzing input images to decide body movement. When body parts are seen, the movements of hands and feet will be guided by the images; when unseen, the laws of physics guide the controller to generate plausible motion. We design an end-to-end method that does not rely on any intermediate representations and learns to directly map from images and headset poses to humanoid control signals. To train our method, we also propose a large-scale synthetic dataset created using camera configurations compatible with a commercially available VR headset (Quest 2) and show promising results on real-world captures. To demonstrate the applicability of our framework, we also test it on an AR headset with a forward-facing camera.
Paper Structure (45 sections, 1 equation, 9 figures, 9 tables, 1 algorithm)

This paper contains 45 sections, 1 equation, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Avatar control using $\texttt{SimXR}$ on real world AR/VR headsets. (Left): An indoor kitchen setting using AR headset. $\texttt{SimXR}$ controls the humanoid using headset pose and visual input from two front-facing cameras. (Right): An office setting using VR headset (Quest 2). Humanoid motion is driven by the headset pose, two side-facing and two up-facing cameras.
  • Figure 2: SimXR framework applied to two AR/VR devices. (Top): Quest 2 UnknownUnknown-cx headset with 4 SLAM cameras, two facing upward and two downward. (Bottom): Aria glass UnknownUnknown-br with two forward-facing SLAM cameras. Both devices provides 6DoF headset tracking in real-time.
  • Figure 3: Our proposed $\texttt{SimXR}$ framework. From a large-scale human motion dataset, we first train a motion imitator (PHC Luo2023-ft) and render synthetic images. Then, we train our vision and headset pose-based controller through distilling from the pretrained imitator.
  • Figure 4: Qualitative results on synthetic and real-world AR/VR headset data. We visualize camera images, simulation, rendered mesh from simulation states, and third-person reference views. We show that our method can transfer to real-world data and handle diverse body poses including kicking, kneeling, etc. For AR headset results, the third-person view is provided by another subject wearing a headset.
  • Figure 5: Failure cases of our method: misplaced feet or hands.
  • ...and 4 more figures