Table of Contents
Fetching ...

HMD^2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device

Vladimir Guzov, Yifeng Jiang, Fangzhou Hong, Gerard Pons-Moll, Richard Newcombe, C. Karen Liu, Yuting Ye, Lingni Ma

TL;DR

HMD2 presents an online full-body motion generation framework from a single head-mounted device by fusing head-motion, environment SLAM features, and egocentric image embeddings through a Transformer-based diffusion model. The system performs autoregressive inpainting to enable low-latency, online inference while maintaining temporal coherence, evaluated on the large-scale Nymeria real-world dataset. It achieves state-of-the-art results in reconstruction, realism, and diversity, significantly outperforming baselines under both high- and low-latency regimes. The approach demonstrates the practical potential for telepresence, navigation, and activity-rich interactions with minimal hardware, while outlining avenues for further improvements in perception, privacy, and on-device deployment.

Abstract

This paper investigates the generation of realistic full-body human motion using a single head-mounted device with an outward-facing color camera and the ability to perform visual SLAM. To address the ambiguity of this setup, we present HMD^2, a novel system that balances motion reconstruction and generation. From a reconstruction standpoint, it aims to maximally utilize the camera streams to produce both analytical and learned features, including head motion, SLAM point cloud, and image embeddings. On the generative front, HMD^2 employs a multi-modal conditional motion diffusion model with a Transformer backbone to maintain temporal coherence of generated motions, and utilizes autoregressive inpainting to facilitate online motion inference with minimal latency (0.17 seconds). We show that our system provides an effective and robust solution that scales to a diverse dataset of over 200 hours of motion in complex indoor and outdoor environments.

HMD^2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device

TL;DR

HMD2 presents an online full-body motion generation framework from a single head-mounted device by fusing head-motion, environment SLAM features, and egocentric image embeddings through a Transformer-based diffusion model. The system performs autoregressive inpainting to enable low-latency, online inference while maintaining temporal coherence, evaluated on the large-scale Nymeria real-world dataset. It achieves state-of-the-art results in reconstruction, realism, and diversity, significantly outperforming baselines under both high- and low-latency regimes. The approach demonstrates the practical potential for telepresence, navigation, and activity-rich interactions with minimal hardware, while outlining avenues for further improvements in perception, privacy, and on-device deployment.

Abstract

This paper investigates the generation of realistic full-body human motion using a single head-mounted device with an outward-facing color camera and the ability to perform visual SLAM. To address the ambiguity of this setup, we present HMD^2, a novel system that balances motion reconstruction and generation. From a reconstruction standpoint, it aims to maximally utilize the camera streams to produce both analytical and learned features, including head motion, SLAM point cloud, and image embeddings. On the generative front, HMD^2 employs a multi-modal conditional motion diffusion model with a Transformer backbone to maintain temporal coherence of generated motions, and utilizes autoregressive inpainting to facilitate online motion inference with minimal latency (0.17 seconds). We show that our system provides an effective and robust solution that scales to a diverse dataset of over 200 hours of motion in complex indoor and outdoor environments.
Paper Structure (21 sections, 4 equations, 8 figures, 9 tables)

This paper contains 21 sections, 4 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 2: Overview: HMD2 generates realistic full-body motion that aligns with the signals from a single head-mounted device. Using the image streams from the egocentric camera and head trajectory with the feature cloud from the onboard SLAM system, we employ a diffusion-based framework to generate the wearer's full-body motion.
  • Figure 3: A typical input sequence from egocentric camera with only few body parts of the wearer intermittently visible, rendering standard full-body reconstruction network backbones ineffective.
  • Figure 4: Autoregressive inpainting is performed at each reverse diffusion step to allow long sequence generations both in high- and low-latency settings.
  • Figure 5: Qualitative comparison between HMD2 (Ours) and baseline methods.
  • Figure 6: Our system can predict diverse outcomes from identical input (head pose marked as a sphere with coordinate system).
  • ...and 3 more figures