Table of Contents
Fetching ...

RoHM: Robust Human Motion Reconstruction via Diffusion

Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, Federica Bogo

TL;DR

RoHM tackles robust 3D human motion reconstruction from monocular RGB(-D) videos under noise and occlusions. It introduces two diffusion-based models, TrajNet for global root trajectory and PoseNet for local body motion, coupled via a TrajControl conditioning module and an iterative inference scheme, with score-guided sampling to enforce physical plausibility and image consistency. Trained with curriculum on AMASS and evaluated across AMASS, PROX, and EgoBody, RoHM achieves superior accuracy and realism while offering substantially faster test-time performance than optimization-based baselines. The approach enables robust denoising, spatial and temporal infilling, and has practical impact for AR/VR, robotics, and human-scene interaction tasks.

Abstract

We propose RoHM, an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. The former do not recover globally coherent motion and fail under occlusions; the latter are time-consuming, prone to local minima, and require manual tuning. To overcome these shortcomings, we exploit the iterative, denoising nature of diffusion models. RoHM is a novel diffusion-based motion model that, conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models, one for global trajectory and one for local motion. To capture the correlations between the two, we then introduce a novel conditioning module, combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html.

RoHM: Robust Human Motion Reconstruction via Diffusion

TL;DR

RoHM tackles robust 3D human motion reconstruction from monocular RGB(-D) videos under noise and occlusions. It introduces two diffusion-based models, TrajNet for global root trajectory and PoseNet for local body motion, coupled via a TrajControl conditioning module and an iterative inference scheme, with score-guided sampling to enforce physical plausibility and image consistency. Trained with curriculum on AMASS and evaluated across AMASS, PROX, and EgoBody, RoHM achieves superior accuracy and realism while offering substantially faster test-time performance than optimization-based baselines. The approach enables robust denoising, spatial and temporal infilling, and has practical impact for AR/VR, robotics, and human-scene interaction tasks.

Abstract

We propose RoHM, an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. The former do not recover globally coherent motion and fail under occlusions; the latter are time-consuming, prone to local minima, and require manual tuning. To overcome these shortcomings, we exploit the iterative, denoising nature of diffusion models. RoHM is a novel diffusion-based motion model that, conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models, one for global trajectory and one for local motion. To capture the correlations between the two, we then introduce a novel conditioning module, combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html.
Paper Structure (24 sections, 13 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 13 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Our method robustly reconstructs smooth and complete 3D human motion from different inputs, such as incomplete and noisy motion estimates (left), RGB-D (middle) and RGB (right) monocular videos. We learn diffusion-based models to denoise and infill both root trajectory in global space and local motion in body-root space for visible and occluded joints, predicting whether feet are in contact or not with the ground for improved physical plausibility. Compared with baselines such as HuMoR rempe2021humor, our method reconstructs more plausible motions that faithfully match image evidence, especially under heavy occlusions.
  • Figure 2: Overview of our approach. Given an initial noisy motion sequence $\tilde{\boldsymbol{X}}=(\tilde{\boldsymbol{R}}, \tilde{\boldsymbol{P}})$ and the corresponding root/body joint occlusion masks $\boldsymbol{M}_{\boldsymbol{R}}$ and $\boldsymbol{M}_{\boldsymbol{P}}$, we employ two diffusion-based models, TrajNet and PoseNet, to estimate global root trajectory $\hat{\boldsymbol{R}}_0$ and local pose $\hat{\boldsymbol{P}}_0$, separately (Sec. \ref{['sec:diffmotion']}). We leverage an additional conditioning module, TrajControl, to fine-tune TrajNet and flexibly condition it on denoised local pose $\hat{\boldsymbol{P}}_0$, leading to improved trajectory reconstruction (Sec. \ref{['sec:trajcontrol']}). At inference time, TrajNet, PoseNet, and TrajControl are leveraged in an iterative inference scheme to refine local and global motion (Sec. \ref{['sec:inference']}).
  • Figure 3: Model performance wrt different input noise levels for the Occ-L. setup on the AMASS test set.
  • Figure 4: Qualitative results on AMASS. Given noisy input with occluded lower body, we reconstruct more accurate and realistic motions (row 1-2), with fewer foot-ground penetrations (row 3) than the baseline method.
  • Figure 5: Qualitative results on PROX (RGB-D input, left) and EgoBody (RGB input, right).
  • ...and 4 more figures