Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space
Yuan Wang, Zhao Wang, Junhao Gong, Di Huang, Tong He, Wanli Ouyang, Jile Jiao, Xuetao Feng, Qi Dou, Shixiang Tang, Dan Xu
TL;DR
This work introduces Holistic-Motion2D, a million-scale 2D holistic motion dataset with text annotations to propel text-driven whole-body motion generation in 2D space. It proposes Tender, a baseline model that combines a Part-Aware VAE with a Confidence-Aware Generation framework and diffusion-based synthesis conditioned on CLIP text, complemented by MoLIP for semantic retrieval-based evaluation. The paper demonstrates that 2D motions can serve as scalable priors for diverse, expressive movements and shows strong performance gains over 3D-focused baselines, plus promising downstream applications such as pose-guided video generation and 3D motion lifting. Overall, this work establishes a practical, scalable路径 toward general 2D motion synthesis and offers a foundation for future 3D lifting and multi-domain human motion research, while acknowledging limitations like single-person motions and licensing considerations.
Abstract
In this paper, we introduce a novel path to $\textit{general}$ human motion generation by focusing on 2D space. Traditional methods have primarily generated human motions in 3D, which, while detailed and realistic, are often limited by the scope of available 3D motion data in terms of both the size and the diversity. To address these limitations, we exploit extensive availability of 2D motion data. We present $\textbf{Holistic-Motion2D}$, the first comprehensive and large-scale benchmark for 2D whole-body motion generation, which includes over 1M in-the-wild motion sequences, each paired with high-quality whole-body/partial pose annotations and textual descriptions. Notably, Holistic-Motion2D is ten times larger than the previously largest 3D motion dataset. We also introduce a baseline method, featuring innovative $\textit{whole-body part-aware attention}$ and $\textit{confidence-aware modeling}$ techniques, tailored for 2D $\underline{\text T}$ext-driv$\underline{\text{EN}}$ whole-bo$\underline{\text D}$y motion gen$\underline{\text{ER}}$ation, namely $\textbf{Tender}$. Extensive experiments demonstrate the effectiveness of $\textbf{Holistic-Motion2D}$ and $\textbf{Tender}$ in generating expressive, diverse, and realistic human motions. We also highlight the utility of 2D motion for various downstream applications and its potential for lifting to 3D motion. The page link is: https://holistic-motion2d.github.io.
