Table of Contents
Fetching ...

DreaMoving: A Human Video Generation Framework based on Diffusion Models

Mengyang Feng, Jinlin Liu, Kai Yu, Yuan Yao, Zheng Hui, Xiefan Guo, Xianhui Lin, Haolan Xue, Chen Shi, Xiaowen Li, Aojie Li, Xiaoyang Kang, Biwen Lei, Miaomiao Cui, Peiran Ren, Xuansong Xie

TL;DR

DreaMoving addresses the challenge of controllable, identity-preserving human video generation by introducing a diffusion-based framework with a Video ControlNet for motion conditioning and a Content Guider for appearance grounding. It combines motion blocks, long-frame pretraining, and a multi-stage training pipeline (Content Guider training, long-frame pretraining, Video ControlNet training, and expression fine-tuning) to enable pose/depth conditioning and image-based identity guidance. The approach supports text-only, image-only, and mixed prompts, achieving high-quality, temporally consistent videos and generalizing to unseen styles. This work offers a practical pathway for customizable, identity-consistent human video synthesis at scale.

Abstract

In this paper, we present DreaMoving, a diffusion-based controllable video generation framework to produce high-quality customized human videos. Specifically, given target identity and posture sequences, DreaMoving can generate a video of the target identity moving or dancing anywhere driven by the posture sequences. To this end, we propose a Video ControlNet for motion-controlling and a Content Guider for identity preserving. The proposed model is easy to use and can be adapted to most stylized diffusion models to generate diverse results. The project page is available at https://dreamoving.github.io/dreamoving

DreaMoving: A Human Video Generation Framework based on Diffusion Models

TL;DR

DreaMoving addresses the challenge of controllable, identity-preserving human video generation by introducing a diffusion-based framework with a Video ControlNet for motion conditioning and a Content Guider for appearance grounding. It combines motion blocks, long-frame pretraining, and a multi-stage training pipeline (Content Guider training, long-frame pretraining, Video ControlNet training, and expression fine-tuning) to enable pose/depth conditioning and image-based identity guidance. The approach supports text-only, image-only, and mixed prompts, achieving high-quality, temporally consistent videos and generalizing to unseen styles. This work offers a practical pathway for customizable, identity-consistent human video synthesis at scale.

Abstract

In this paper, we present DreaMoving, a diffusion-based controllable video generation framework to produce high-quality customized human videos. Specifically, given target identity and posture sequences, DreaMoving can generate a video of the target identity moving or dancing anywhere driven by the posture sequences. To this end, we propose a Video ControlNet for motion-controlling and a Content Guider for identity preserving. The proposed model is easy to use and can be adapted to most stylized diffusion models to generate diverse results. The project page is available at https://dreamoving.github.io/dreamoving
Paper Structure (12 sections, 1 equation, 5 figures)

This paper contains 12 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: The overview of DreaMoving. The Video ControlNet is the image ControlNet zhang2023adding injected with motion blocks after each U-Net block. The Video ControlNet processes the control sequence (pose or depth) to additional temporal residuals. The Denoising U-Net is a derived Stable-Diffusion rombach2021highresolution U-Net with motion blocks for video generation. The Content Guider transfers the input text prompts and appearance expressions, such as the human face (the cloth is optional), to content embeddings for cross attention.
  • Figure 2: The results of DreaMoving with text prompt as input.
  • Figure 3: The results of DreaMoving with text prompt and face image as inputs.
  • Figure 4: The results of DreaMoving with face and cloth images as inputs.
  • Figure 5: The results of DreaMoving with stylized image as input.