Lifting Motion to the 3D World via 2D Diffusion
Jiaman Li, C. Karen Liu, Jiajun Wu
TL;DR
This work tackles the problem of estimating global 3D motion from 2D pose sequences without any 3D supervision. It introduces MVLift, a four-stage framework that leverages line-conditioned 2D diffusion priors, epipolar geometry, and multi-view optimization to progressively enforce cross-view consistency, culminating in a final diffusion model that outputs multi-view-consistent 2D sequences for efficient 3D reconstruction. A synthetic data generation pipeline creates strictly consistent multi-view 2D sequences, enabling training without 3D ground truth and generalizing to human, animal, and human-object interaction domains, where MVLift outperforms 3D-supervised baselines on five datasets. The approach provides a scalable, diffusion-based pathway to realistic, globally coherent 3D motion from 2D observations, with broad implications for graphics, robotics, and embodied AI.
Abstract
Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.
