Table of Contents
Fetching ...

DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhofer

TL;DR

DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations by factorizing motion learning into two diffusion models, achieves state-of-the-art performance.

Abstract

We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/

DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

TL;DR

DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations by factorizing motion learning into two diffusion models, achieves state-of-the-art performance.

Abstract

We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/
Paper Structure (24 sections, 11 equations, 12 figures, 5 tables)

This paper contains 24 sections, 11 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Dual Motion Diffusion (DuoMo) recovers world-space human motion from unconstrained monocular videos. Our two-prior approach reconstructs accurate world-space motion (left), completes motion under occlusion or missing observation (middle), and generates scene-consistent motion with guided sampling (right). Moreover, our method outputs the mesh vertices directly without a SMPL model.
  • Figure 2: Method overview. (A) In the first stage, our camera-space model encodes video features and generates camera-space human motion. This motion is lifted to the world coordinates using estimated camera poses, becoming the initial proposal for world-space human motion. Some predictions are missing due to subject out of frame. In the second stage, the world-space model encodes the noisy world-space motion and generates globally consistent world-space motion. Plots at the bottom visualize the pelvis depth in the world coordinates. (B) Camera-space model architecture. (C) World-space model architecture.
  • Figure 3: Height conditioning. Our camera-space model can generate predictions based on input body heights. As shown at the bottom row, height impacts distance from camera and thus plays an important role in world-space accuracy.
  • Figure 4: Guided sampling on Egobody egobody. Our proposed guidances correct for drifting and improve world-space trajectory accuracy.
  • Figure 5: Qualitative comparison on Egobody ego_body. All methods use the ground truth camera poses. We observe that results from GVHMR shen2024gvhmr are smooth but drift under shaky camera motion. PromptHMR wang2025prompthmr has better position accuracy but is not robust to occlusion and depth ambiguity. Our results show both accuracy and robustness.
  • ...and 7 more figures