Table of Contents
Fetching ...

Estimating Body and Hand Motion in an Ego-sensed World

Brent Yi, Vickie Ye, Maya Zheng, Yunqi Li, Lea Müller, Georgios Pavlakos, Yi Ma, Jitendra Malik, Angjoo Kanazawa

TL;DR

EgoAllo addresses ego-sensed human motion estimation by learning a head-motion conditioned diffusion prior that outputs body pose, height, and hand parameters in an allocentric scene frame. A key contribution is an invariant conditioning representation that is simultaneously spatially and temporally robust, enabling accurate ground-grounded body estimates and improved hand estimation through diffusion-guided sampling and test-time guidance. The approach includes global alignment to place samples in the world, and sequence-length extrapolation to handle long sequences, with substantial gains over baselines across multiple datasets. The result is a scalable, geometry-aware framework that leverages SLAM poses and egocentric video to recover metric-scale human motion behind the wearer's viewpoint, with practical implications for AR, robotics, and assistive tech.

Abstract

We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture a device wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve hand estimation: the resulting kinematic and temporal constraints can reduce world-frame errors in single-frame estimates by 40%. Project page: https://egoallo.github.io/

Estimating Body and Hand Motion in an Ego-sensed World

TL;DR

EgoAllo addresses ego-sensed human motion estimation by learning a head-motion conditioned diffusion prior that outputs body pose, height, and hand parameters in an allocentric scene frame. A key contribution is an invariant conditioning representation that is simultaneously spatially and temporally robust, enabling accurate ground-grounded body estimates and improved hand estimation through diffusion-guided sampling and test-time guidance. The approach includes global alignment to place samples in the world, and sequence-length extrapolation to handle long sequences, with substantial gains over baselines across multiple datasets. The result is a scalable, geometry-aware framework that leverages SLAM poses and egocentric video to recover metric-scale human motion behind the wearer's viewpoint, with practical implications for AR, robotics, and assistive tech.

Abstract

We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture a device wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve hand estimation: the resulting kinematic and temporal constraints can reduce world-frame errors in single-frame estimates by 40%. Project page: https://egoallo.github.io/
Paper Structure (25 sections, 12 equations, 12 figures, 4 tables)

This paper contains 25 sections, 12 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: EgoAllo. We present a system that estimates human body pose, height, and hand parameters from egocentric SLAM poses and images. Outputs capture the wearer's actions in the allocentric reference frame of the scene, which we visualize here with 3D reconstructions.
  • Figure 2: Overview of components in EgoAllo. We restrict the diffusion model to local body parameters (Section \ref{['sec:diffusion_output_representation']}). An invariant parameterization $g(\cdot)$ (Section \ref{['sec:invariant_conditioning']}) of SLAM poses is used to condition a diffusion model. These can be placed into the global coordinate frame via global alignment (Section \ref{['sec:global_alignment']}) to input poses. When available, egocentric video is used for hand detection via HaMeR pavlakos2023reconstructing, which can be incorporated into samples via guidance (Section \ref{['sec:guidance_losses']}).
  • Figure 3: Locally canonicalized coordinate frames. We compute our invariant conditioning parameterization (Equation \ref{['eq:cond_param']}) using transformations computed from three coordinate frames. Following somasundaram2023projectaria, the CPF has the $z$-axis forward. Following HuMoR rempe2021humor, the world and canonical $z$-axes point up. Canonical frames are computed by projecting the CPF frame origin to the ground plane, then aligning the canonical $y$-axis to the CPF forward direction.
  • Figure 4: Egocentric human motion estimation for a running sequence. We show the ground-truth, an output from EgoAllo, and outputs from two baselines. The glasses CAD model is placed at the conditioning transformation $\textbf{T}_{\text{world},\text{cpf}}$.
  • Figure 5: Body estimation improves hand estimation. We show raw outputs from HaMeR pavlakos2023reconstructing in blue and hand-body estimations from EgoAllo in purple. Top: improved scene interaction during touchscreen operation with EgoAllo-Mono. We know a priori that the fingers are contacting the screen in this sequence. Bottom: qualitative examples from EgoExo grauman2023egoexo4d evaluation, showing the differences between monocular hands and EgoAllo-Wrist3D estimates.
  • ...and 7 more figures