WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion
Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black
TL;DR
WHAM tackles the challenge of reconstructing accurate 3D human motion in global world coordinates from monocular video with a moving camera. It proposes a two-stage fusion pipeline that first lifts 2D keypoints into camera-space 3D poses and then uses a global trajectory module incorporating camera angular velocity, plus a trajectory refinement stage guided by foot contact to anchor motion on non-flat terrains. The method relies on AMASS-based pretraining to learn robust motion context and a feature integrator to fuse sparse keypoint information with dense visual cues, enabling online, real-time inference at high frame rates. Across multiple in-the-wild benchmarks, WHAM achieves state-of-the-art performance for both per-frame 3D pose accuracy and global trajectory estimation, while remaining computationally efficient and suitable for real-time applications. The work demonstrates the value of combining motion context, visual context, and explicit foot-ground contact for robust world-grounded motion capture from monocular video.
Abstract
The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes at http://wham.is.tue.mpg.de/
