Table of Contents
Fetching ...

Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

Yuan Wang, Zhao Wang, Junhao Gong, Di Huang, Tong He, Wanli Ouyang, Jile Jiao, Xuetao Feng, Qi Dou, Shixiang Tang, Dan Xu

TL;DR

This work introduces Holistic-Motion2D, a million-scale 2D holistic motion dataset with text annotations to propel text-driven whole-body motion generation in 2D space. It proposes Tender, a baseline model that combines a Part-Aware VAE with a Confidence-Aware Generation framework and diffusion-based synthesis conditioned on CLIP text, complemented by MoLIP for semantic retrieval-based evaluation. The paper demonstrates that 2D motions can serve as scalable priors for diverse, expressive movements and shows strong performance gains over 3D-focused baselines, plus promising downstream applications such as pose-guided video generation and 3D motion lifting. Overall, this work establishes a practical, scalable路径 toward general 2D motion synthesis and offers a foundation for future 3D lifting and multi-domain human motion research, while acknowledging limitations like single-person motions and licensing considerations.

Abstract

In this paper, we introduce a novel path to $\textit{general}$ human motion generation by focusing on 2D space. Traditional methods have primarily generated human motions in 3D, which, while detailed and realistic, are often limited by the scope of available 3D motion data in terms of both the size and the diversity. To address these limitations, we exploit extensive availability of 2D motion data. We present $\textbf{Holistic-Motion2D}$, the first comprehensive and large-scale benchmark for 2D whole-body motion generation, which includes over 1M in-the-wild motion sequences, each paired with high-quality whole-body/partial pose annotations and textual descriptions. Notably, Holistic-Motion2D is ten times larger than the previously largest 3D motion dataset. We also introduce a baseline method, featuring innovative $\textit{whole-body part-aware attention}$ and $\textit{confidence-aware modeling}$ techniques, tailored for 2D $\underline{\text T}$ext-driv$\underline{\text{EN}}$ whole-bo$\underline{\text D}$y motion gen$\underline{\text{ER}}$ation, namely $\textbf{Tender}$. Extensive experiments demonstrate the effectiveness of $\textbf{Holistic-Motion2D}$ and $\textbf{Tender}$ in generating expressive, diverse, and realistic human motions. We also highlight the utility of 2D motion for various downstream applications and its potential for lifting to 3D motion. The page link is: https://holistic-motion2d.github.io.

Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

TL;DR

This work introduces Holistic-Motion2D, a million-scale 2D holistic motion dataset with text annotations to propel text-driven whole-body motion generation in 2D space. It proposes Tender, a baseline model that combines a Part-Aware VAE with a Confidence-Aware Generation framework and diffusion-based synthesis conditioned on CLIP text, complemented by MoLIP for semantic retrieval-based evaluation. The paper demonstrates that 2D motions can serve as scalable priors for diverse, expressive movements and shows strong performance gains over 3D-focused baselines, plus promising downstream applications such as pose-guided video generation and 3D motion lifting. Overall, this work establishes a practical, scalable路径 toward general 2D motion synthesis and offers a foundation for future 3D lifting and multi-domain human motion research, while acknowledging limitations like single-person motions and licensing considerations.

Abstract

In this paper, we introduce a novel path to human motion generation by focusing on 2D space. Traditional methods have primarily generated human motions in 3D, which, while detailed and realistic, are often limited by the scope of available 3D motion data in terms of both the size and the diversity. To address these limitations, we exploit extensive availability of 2D motion data. We present , the first comprehensive and large-scale benchmark for 2D whole-body motion generation, which includes over 1M in-the-wild motion sequences, each paired with high-quality whole-body/partial pose annotations and textual descriptions. Notably, Holistic-Motion2D is ten times larger than the previously largest 3D motion dataset. We also introduce a baseline method, featuring innovative and techniques, tailored for 2D ext-driv whole-boy motion genation, namely . Extensive experiments demonstrate the effectiveness of and in generating expressive, diverse, and realistic human motions. We also highlight the utility of 2D motion for various downstream applications and its potential for lifting to 3D motion. The page link is: https://holistic-motion2d.github.io.
Paper Structure (45 sections, 7 equations, 11 figures, 17 tables)

This paper contains 45 sections, 7 equations, 11 figures, 17 tables.

Figures (11)

  • Figure 1: Overview of Holistic-Motion2D and generated 2D whole-body motions.Left: 2D human motion data in our dataset with (a) whole-body, (b) face, and (c) hand motions. Right: the generated 2D whole-body motion from our model. Every 2D motion sequence is shown following a temporal progression from left to right.
  • Figure 2: Comparison of estimating 3D keypoints (SPIN kolotouros2019learning) and direct prediction of 2D keypoints (RTMPose jiang2023rtmpose) from images. The precision of 2D keypoints demonstrates robustness to variations in viewpoint.
  • Figure 3: Overview of the keypoints and pose descriptions annotation pipeline of 2D whole-body motions.
  • Figure 4: Overview of our Tender framework. (a) PA-VAE to embed whole-body part-aware spatio-temporal features into a latent space. (b) The diffusion model to generate realistic whole-body motions conditioned on texts. (c) Whole-body Part-Aware Attention to model spatial relations of different parts with CAG mechanism.
  • Figure 5: Qualitative results of our Tender compared with previous SOTA methods. Our Tender generates clearly more vivid human motions and preserves the fidelity, together with superior temporal consistency.
  • ...and 6 more figures