Lifting Motion to the 3D World via 2D Diffusion

Jiaman Li; C. Karen Liu; Jiajun Wu

Lifting Motion to the 3D World via 2D Diffusion

Jiaman Li, C. Karen Liu, Jiajun Wu

TL;DR

This work tackles the problem of estimating global 3D motion from 2D pose sequences without any 3D supervision. It introduces MVLift, a four-stage framework that leverages line-conditioned 2D diffusion priors, epipolar geometry, and multi-view optimization to progressively enforce cross-view consistency, culminating in a final diffusion model that outputs multi-view-consistent 2D sequences for efficient 3D reconstruction. A synthetic data generation pipeline creates strictly consistent multi-view 2D sequences, enabling training without 3D ground truth and generalizing to human, animal, and human-object interaction domains, where MVLift outperforms 3D-supervised baselines on five datasets. The approach provides a scalable, diffusion-based pathway to realistic, globally coherent 3D motion from 2D observations, with broad implications for graphics, robotics, and embodied AI.

Abstract

Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.

Lifting Motion to the 3D World via 2D Diffusion

TL;DR

Abstract

Lifting Motion to the 3D World via 2D Diffusion

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)