Table of Contents
Fetching ...

Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation

Huaijin Pi, Ruoxi Guo, Zehong Shen, Qing Shuai, Zechen Hu, Zhumei Wang, Yajiao Dong, Ruizhen Hu, Taku Komura, Sida Peng, Xiaowei Zhou

TL;DR

This work tackles the scarcity and cost of 3D motion capture data by leveraging abundant 2D video motion for text-driven 3D motion generation. It introduces a two-stage approach that first learns a 2D local-motion prior from text–motion pairs and then finetunes it into a multi-view model with view consistency and root dynamics, enabling robust 3D motion via triangulation and root-velocity accumulation. On HumanML3D, the method achieves improved FID and competitive metrics, while enabling a broader range of motions, especially under novel text prompts. The approach demonstrates how large-scale 2D motion data can effectively augment 3D motion synthesis, offering a cost-efficient path to diverse and realistic human motions for animation and interactive applications.

Abstract

Text-driven human motion synthesis is capturing significant attention for its ability to effortlessly generate intricate movements from abstract text cues, showcasing its potential for revolutionizing motion design not only in film narratives but also in virtual reality experiences and computer game development. Existing methods often rely on 3D motion capture data, which require special setups resulting in higher costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore leveraging 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-motion pairs. To enhance this model to synthesize 3D motion, we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Experiments on the HumanML3D dataset and novel text prompts demonstrate that our method efficiently utilizes 2D data, supporting realistic 3D human motion generation and broadening the range of motion types it supports. Our code will be made publicly available at https://zju3dv.github.io/Motion-2-to-3/.

Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation

TL;DR

This work tackles the scarcity and cost of 3D motion capture data by leveraging abundant 2D video motion for text-driven 3D motion generation. It introduces a two-stage approach that first learns a 2D local-motion prior from text–motion pairs and then finetunes it into a multi-view model with view consistency and root dynamics, enabling robust 3D motion via triangulation and root-velocity accumulation. On HumanML3D, the method achieves improved FID and competitive metrics, while enabling a broader range of motions, especially under novel text prompts. The approach demonstrates how large-scale 2D motion data can effectively augment 3D motion synthesis, offering a cost-efficient path to diverse and realistic human motions for animation and interactive applications.

Abstract

Text-driven human motion synthesis is capturing significant attention for its ability to effortlessly generate intricate movements from abstract text cues, showcasing its potential for revolutionizing motion design not only in film narratives but also in virtual reality experiences and computer game development. Existing methods often rely on 3D motion capture data, which require special setups resulting in higher costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore leveraging 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-motion pairs. To enhance this model to synthesize 3D motion, we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Experiments on the HumanML3D dataset and novel text prompts demonstrate that our method efficiently utilizes 2D data, supporting realistic 3D human motion generation and broadening the range of motion types it supports. Our code will be made publicly available at https://zju3dv.github.io/Motion-2-to-3/.

Paper Structure

This paper contains 17 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of our key idea. (a) Our approach leverages 2D motion data to improve 3D motion generation by unifying 2D and 3D motion data. (b) Our framework yields better FID and generates a broader range of motion types.
  • Figure 2: Challenge of 2D motion from the real world. In the real-world videos 23iccv_emdb, both the camera and humans move in 3D space, resulting in 2D motion that combines both movements.
  • Figure 3: Our Pipeline. We design a Multi-view Diffusion model (a) to generate multi-view results (for simplicity, camera embedding is omitted in the figure). During inference, the Multi-view Diffusion model predicts 2D local motion and root velocity (b). Then, we use triangulation 97_triangulation to recover 3D local joint positions (c) and accumulate root velocity to obtain 3D global trajectory (d), resulting in the final 3D motion (e).
  • Figure 4: Qualitative comparison. The first two lines are motion out of the HumanML3D 22cvpr_humanml3d dataset. Baseline methods produce incorrect types of motion, while ours are more consistent with the text descriptions. The last row demonstrates that our approach successfully generates standing motion in alignment with the text descriptions, whereas baseline methods fail to produce this correctly. The unnatural poses are highlighted in the red boxes. The semantics misalignment is highlighted in the dashed boxes.
  • Figure 5: Qualitative results of different strategies using 2D data. Baseline methods 24cvpr_mas23iccv_motionbert fail to generate a motion with a global movement. The variant using 2D condition as input 23iccv_zero1to3 may generate incorrect root movements, leading to the floating motion. The unnatural poses are highlighted in the red boxes.
  • ...and 1 more figures