Estimating Body and Hand Motion in an Ego-sensed World
Brent Yi, Vickie Ye, Maya Zheng, Yunqi Li, Lea Müller, Georgios Pavlakos, Yi Ma, Jitendra Malik, Angjoo Kanazawa
TL;DR
EgoAllo addresses ego-sensed human motion estimation by learning a head-motion conditioned diffusion prior that outputs body pose, height, and hand parameters in an allocentric scene frame. A key contribution is an invariant conditioning representation that is simultaneously spatially and temporally robust, enabling accurate ground-grounded body estimates and improved hand estimation through diffusion-guided sampling and test-time guidance. The approach includes global alignment to place samples in the world, and sequence-length extrapolation to handle long sequences, with substantial gains over baselines across multiple datasets. The result is a scalable, geometry-aware framework that leverages SLAM poses and egocentric video to recover metric-scale human motion behind the wearer's viewpoint, with practical implications for AR, robotics, and assistive tech.
Abstract
We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture a device wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve hand estimation: the resulting kinematic and temporal constraints can reduce world-frame errors in single-frame estimates by 40%. Project page: https://egoallo.github.io/
