HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, Lihong Liu, Xingang Wang
TL;DR
HumanDreamer addresses the challenge of text-driven human-motion video generation by decoupling the task into Text-to-Pose and Pose-to-Video stages. It introduces MotionVid, a large-scale dataset of text–pose pairs, and MotionDiT, a diffusion-transformer model with a Local+Global attention design, augmented by the LAMA loss via CLoP to improve pose fidelity and text alignment. The Pose-to-Video module builds on a CogVideoX-inspired backbone with controllable conditioning to render videos from pose sequences, achieving state-of-the-art metrics on Text-to-Pose and competitive results on Pose-to-Video and Text-to-Video tasks. Overall, the approach demonstrates strong generation quality, versatile downstream utility (e.g., pose sequence prediction, 2D-to-3D lifting), and scalability through the MotionVid data pipeline, offering a flexible, text-driven path to realistic human-motion videos.
Abstract
Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.
