Table of Contents
Fetching ...

HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, Lihong Liu, Xingang Wang

TL;DR

HumanDreamer addresses the challenge of text-driven human-motion video generation by decoupling the task into Text-to-Pose and Pose-to-Video stages. It introduces MotionVid, a large-scale dataset of text–pose pairs, and MotionDiT, a diffusion-transformer model with a Local+Global attention design, augmented by the LAMA loss via CLoP to improve pose fidelity and text alignment. The Pose-to-Video module builds on a CogVideoX-inspired backbone with controllable conditioning to render videos from pose sequences, achieving state-of-the-art metrics on Text-to-Pose and competitive results on Pose-to-Video and Text-to-Video tasks. Overall, the approach demonstrates strong generation quality, versatile downstream utility (e.g., pose sequence prediction, 2D-to-3D lifting), and scalability through the MotionVid data pipeline, offering a flexible, text-driven path to realistic human-motion videos.

Abstract

Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.

HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

TL;DR

HumanDreamer addresses the challenge of text-driven human-motion video generation by decoupling the task into Text-to-Pose and Pose-to-Video stages. It introduces MotionVid, a large-scale dataset of text–pose pairs, and MotionDiT, a diffusion-transformer model with a Local+Global attention design, augmented by the LAMA loss via CLoP to improve pose fidelity and text alignment. The Pose-to-Video module builds on a CogVideoX-inspired backbone with controllable conditioning to render videos from pose sequences, achieving state-of-the-art metrics on Text-to-Pose and competitive results on Pose-to-Video and Text-to-Video tasks. Overall, the approach demonstrates strong generation quality, versatile downstream utility (e.g., pose sequence prediction, 2D-to-3D lifting), and scalability through the MotionVid data pipeline, offering a flexible, text-driven path to realistic human-motion videos.

Abstract

Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.

Paper Structure

This paper contains 28 sections, 17 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Illustration of HumanDreamer. The human-motion video generation is decoupled into two steps: Text-to-Pose generation and Pose-to-Video generation. The decoupled process integrates the flexibility of text control and the controllability of pose guidance.
  • Figure 2: The data cleaning and annotation pipeline for MotionVid begins with raw data sourced from public datasets and the internet, which is then segmented into video clips. To ensure high-quality data, we apply video quality filter, data annotation, human quality filter, and caption quality filter.
  • Figure 3: Training pipeline of the proposed Text-to-Pose generation. Pose data are encoded in latent space via the Pose VAE, which are then processed by the proposed MotionDiT, where local feature aggregation and global attention are utilized to capture information from the entire pose sequence. Finally, the LAMA loss is calculated via the proposed CLoP, which enhances the training of MotionDiT.
  • Figure 4: Visualization results compared to SOTA Text-to-Pose methods. The results demonstrate that our model significantly outperforms other models. Our method generates poses that are more consistent with the text constraints, with keypoints maintaining their integrity and minimal motion jitter. For a better visual comparison, please refer to the supplementary materials.
  • Figure 5: Visualization results compared to SOTA Text-to-Video methods. Mochi1 genmo2024mochi and CogVideoX yang2024cogvideox exhibit issues such as body distortion, weak motion continuity, and neglecting facial generation. In contrast, HumanDreamer is able to generate more coherent and consistent videos with smoother transitions and better attention to details such as facial expressions. For a better visual comparison, please refer to the supplementary materials.
  • ...and 5 more figures