Table of Contents
Fetching ...

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black

TL;DR

WHAM tackles the challenge of reconstructing accurate 3D human motion in global world coordinates from monocular video with a moving camera. It proposes a two-stage fusion pipeline that first lifts 2D keypoints into camera-space 3D poses and then uses a global trajectory module incorporating camera angular velocity, plus a trajectory refinement stage guided by foot contact to anchor motion on non-flat terrains. The method relies on AMASS-based pretraining to learn robust motion context and a feature integrator to fuse sparse keypoint information with dense visual cues, enabling online, real-time inference at high frame rates. Across multiple in-the-wild benchmarks, WHAM achieves state-of-the-art performance for both per-frame 3D pose accuracy and global trajectory estimation, while remaining computationally efficient and suitable for real-time applications. The work demonstrates the value of combining motion context, visual context, and explicit foot-ground contact for robust world-grounded motion capture from monocular video.

Abstract

The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes at http://wham.is.tue.mpg.de/

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

TL;DR

WHAM tackles the challenge of reconstructing accurate 3D human motion in global world coordinates from monocular video with a moving camera. It proposes a two-stage fusion pipeline that first lifts 2D keypoints into camera-space 3D poses and then uses a global trajectory module incorporating camera angular velocity, plus a trajectory refinement stage guided by foot contact to anchor motion on non-flat terrains. The method relies on AMASS-based pretraining to learn robust motion context and a feature integrator to fuse sparse keypoint information with dense visual cues, enabling online, real-time inference at high frame rates. Across multiple in-the-wild benchmarks, WHAM achieves state-of-the-art performance for both per-frame 3D pose accuracy and global trajectory estimation, while remaining computationally efficient and suitable for real-time applications. The work demonstrates the value of combining motion context, visual context, and explicit foot-ground contact for robust world-grounded motion capture from monocular video.

Abstract

The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes at http://wham.is.tue.mpg.de/
Paper Structure (21 sections, 19 equations, 8 figures, 5 tables)

This paper contains 21 sections, 19 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: WHAM: World-grounded Humans with Accurate Motion. State-of-the-art methods like TRACEtrace and SLAHMRslahmr fail to capture global 3D human trajectories accurately when given in-the-wild videos captured using a moving camera, producing implausible world-grounded motion (e.g., foot sliding). To address this, WHAM uses two novel strategies: (1) feature integration from 2D keypoints and pixels to reconstruct precise and pixel-aligned 3D human motion and (2) contact-aware trajectory recovery to place the human in a global coordinate system without foot sliding. Gray dots show the ground-truth global trajectory. See Supplemental Video.
  • Figure 2: An Overview of WHAM. WHAM takes the sequence of 2D keypoints estimated by a pretrained detector and encodes it into a motion feature. WHAM then updates the motion feature using another sequence of image features extracted from the image encoder through the feature integrator. From the updated motion feature, the Local Motion Decoder estimates 3D motion in the camera coordinate system and foot-ground contact probability. The Trajectory Decoder takes the motion feature and camera angular velocity to initially estimate the global root orientation and egocentric velocity, which are then updated through the Trajectory Refiner using the foot-ground contact. The final output of WHAM is pixel-aligned 3D human motion with the 3D trajectory in the global coordinates.
  • Figure 3: WHAM's Two-Stage Training Scheme. During pre-taining, we generate synthetic 2D keypoint sequences from AMASS amass and train a motion encoder and decoder on the generated data (top). We then leverage video datasets with ground truth SMPL parameters, for which there is much less data. We use the fixed-weight pre-trained image encoder and keypoints detector ( ) to extract image features and 2D keypoints. In this stage, we train the feature integration network while fine-tuning the motion encoder and motion/trajectory decoders, marked (bottom).
  • Figure 4: Qualitative comparison with previous state-of-the-art methods for 3D human pose and shape estimation. See text.
  • Figure 5: Qualitative comparison with TRACE trace and SLAHMR slahmr on global human motion estimation with dynamic cameras.
  • ...and 3 more figures