Table of Contents
Fetching ...

Lifting Motion to the 3D World via 2D Diffusion

Jiaman Li, C. Karen Liu, Jiajun Wu

TL;DR

This work tackles the problem of estimating global 3D motion from 2D pose sequences without any 3D supervision. It introduces MVLift, a four-stage framework that leverages line-conditioned 2D diffusion priors, epipolar geometry, and multi-view optimization to progressively enforce cross-view consistency, culminating in a final diffusion model that outputs multi-view-consistent 2D sequences for efficient 3D reconstruction. A synthetic data generation pipeline creates strictly consistent multi-view 2D sequences, enabling training without 3D ground truth and generalizing to human, animal, and human-object interaction domains, where MVLift outperforms 3D-supervised baselines on five datasets. The approach provides a scalable, diffusion-based pathway to realistic, globally coherent 3D motion from 2D observations, with broad implications for graphics, robotics, and embodied AI.

Abstract

Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.

Lifting Motion to the 3D World via 2D Diffusion

TL;DR

This work tackles the problem of estimating global 3D motion from 2D pose sequences without any 3D supervision. It introduces MVLift, a four-stage framework that leverages line-conditioned 2D diffusion priors, epipolar geometry, and multi-view optimization to progressively enforce cross-view consistency, culminating in a final diffusion model that outputs multi-view-consistent 2D sequences for efficient 3D reconstruction. A synthetic data generation pipeline creates strictly consistent multi-view 2D sequences, enabling training without 3D ground truth and generalizing to human, animal, and human-object interaction domains, where MVLift outperforms 3D-supervised baselines on five datasets. The approach provides a scalable, diffusion-based pathway to realistic, globally coherent 3D motion from 2D observations, with broad implications for graphics, robotics, and embodied AI.

Abstract

Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.

Paper Structure

This paper contains 34 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Our framework, MVLift, can be trained only on 2D pose sequences and generate 3D motions including joint rotations and root trajectories in the world coordinate system. The approach generalizes to the various domains of human poses, interactions, and animal poses.
  • Figure 2: Overview of our multi-stage framework. In Stage 1, we train a 2D motion diffusion model conditioned on simulated epipolar lines. Stage 2 utilizes this model to optimize multi-view 2D motion sequences, achieving only roughly consistent sequences. These sequences are used for 3D motion optimization in Stage 3, and the synthetic 3D data is then reprojected into strictly consistent multi-view 2D sequences. In Stage 4, we train a multi-view 2D motion diffusion model on these data to efficiently generate consistent 2D sequences across views.
  • Figure 3: Denoising network of the multi-view 2D motion diffusion model, using View 1 as an example for illustration.
  • Figure 4: Qualitative results of AIST++. On the left, we show the trajectories for each component $(x, y, z)$ of the predicted root trajectory.
  • Figure 5: Results of human perceptual studies.
  • ...and 2 more figures