Table of Contents
Fetching ...

Do You Guys Want to Dance: Zero-Shot Compositional Human Dance Generation with Multiple Persons

Zhe Xu, Kun Wei, Xu Yang, Cheng Deng

TL;DR

The paper tackles compositional human dance generation (cHDG) in multi-person, real-background scenarios by introducing a zero-shot framework, MultiDance-Zero, that combines pose-aware inversion, compositional augmentation, and consistency-guided sampling. It leverages latent diffusion models with ControlNet conditioning to reconstruct composed reference images and learn generalizable text embeddings that adapt to unseen poses. Through a dedicated dataset and evaluation protocol, the authors demonstrate that existing HDG methods struggle to generalize, while their approach achieves superior temporal consistency and pose accuracy. The work enables realistic, multi-person dance synthesis without subject-specific training, with implications for entertainment and education while highlighting current diffusion-model limitations.

Abstract

Human dance generation (HDG) aims to synthesize realistic videos from images and sequences of driving poses. Despite great success, existing methods are limited to generating videos of a single person with specific backgrounds, while the generalizability for real-world scenarios with multiple persons and complex backgrounds remains unclear. To systematically measure the generalizability of HDG models, we introduce a new task, dataset, and evaluation protocol of compositional human dance generation (cHDG). Evaluating the state-of-the-art methods on cHDG, we empirically find that they fail to generalize to real-world scenarios. To tackle the issue, we propose a novel zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos consistent with arbitrary multiple persons and background while precisely following the driving poses. Specifically, in contrast to straightforward DDIM or null-text inversion, we first present a pose-aware inversion method to obtain the noisy latent code and initialization text embeddings, which can accurately reconstruct the composed reference image. Since directly generating videos from them will lead to severe appearance inconsistency, we propose a compositional augmentation strategy to generate augmented images and utilize them to optimize a set of generalizable text embeddings. In addition, consistency-guided sampling is elaborated to encourage the background and keypoints of the estimated clean image at each reverse step to be close to those of the reference image, further improving the temporal consistency of generated videos. Extensive qualitative and quantitative results demonstrate the effectiveness and superiority of our approach.

Do You Guys Want to Dance: Zero-Shot Compositional Human Dance Generation with Multiple Persons

TL;DR

The paper tackles compositional human dance generation (cHDG) in multi-person, real-background scenarios by introducing a zero-shot framework, MultiDance-Zero, that combines pose-aware inversion, compositional augmentation, and consistency-guided sampling. It leverages latent diffusion models with ControlNet conditioning to reconstruct composed reference images and learn generalizable text embeddings that adapt to unseen poses. Through a dedicated dataset and evaluation protocol, the authors demonstrate that existing HDG methods struggle to generalize, while their approach achieves superior temporal consistency and pose accuracy. The work enables realistic, multi-person dance synthesis without subject-specific training, with implications for entertainment and education while highlighting current diffusion-model limitations.

Abstract

Human dance generation (HDG) aims to synthesize realistic videos from images and sequences of driving poses. Despite great success, existing methods are limited to generating videos of a single person with specific backgrounds, while the generalizability for real-world scenarios with multiple persons and complex backgrounds remains unclear. To systematically measure the generalizability of HDG models, we introduce a new task, dataset, and evaluation protocol of compositional human dance generation (cHDG). Evaluating the state-of-the-art methods on cHDG, we empirically find that they fail to generalize to real-world scenarios. To tackle the issue, we propose a novel zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos consistent with arbitrary multiple persons and background while precisely following the driving poses. Specifically, in contrast to straightforward DDIM or null-text inversion, we first present a pose-aware inversion method to obtain the noisy latent code and initialization text embeddings, which can accurately reconstruct the composed reference image. Since directly generating videos from them will lead to severe appearance inconsistency, we propose a compositional augmentation strategy to generate augmented images and utilize them to optimize a set of generalizable text embeddings. In addition, consistency-guided sampling is elaborated to encourage the background and keypoints of the estimated clean image at each reverse step to be close to those of the reference image, further improving the temporal consistency of generated videos. Extensive qualitative and quantitative results demonstrate the effectiveness and superiority of our approach.
Paper Structure (16 sections, 13 equations, 7 figures, 1 table)

This paper contains 16 sections, 13 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: (top) Given reference images and driving poses, (A) DreamPose karras2023dreampose, a state-of-the-art HDG model, fails to generalize to real-world scenarios with complex background and diverse poses. (B) Directly combining null-text inversion mokady2023null and ControlNet zhang2023adding leads to severe temporal inconsistency. (C) Our approach can synthesize a video that simultaneously retains the appearance consistent with the reference images and precisely follows the driving poses.
  • Figure 2: Method overview. Given reference images of multiple persons and background: (A) We propose pose-aware inversion to obtain the noisy latent code $z_T^r$ and initialization text embeddings from a composed reference image $x_0^r$ using pretrained DMs. Moreover, a compositional augmentation strategy is introduced to generate augmented images that share the same poses and appearances as $x_0^r$ but at different spatial locations. (B) We utilize the augmented images to optimize a set of generalizable text embeddings $\{ \oslash_t, c_t\}_{t=0}^T$, which is achieved by jointly minimizing a reference term $\mathcal{L}_{ref}$ and a generalization term $\mathcal{L}_{gen}$. (C) During inference, consistency-guided sampling is elaborated to encourage the background and keypoints of estimated clean image $\tilde{x}_0^g$ to be consistent with those of the reference image $x_0^r$, which can further improve the temporal consistency of generated videos. Red and yellow circles denote the user-provided location and scale of the corresponding persons.
  • Figure 3: Comparisons of reconstruction results. We visually show the reconstruction results of the original null-text inversion and our pose-aware inversion. The pose inputs utilized in the reverse process are attached to the top left.
  • Figure 4: Single-person results. We compare our method with state-of-the-art baselines on cHDG with a single person. The top row shows the composed reference image and driving poses.
  • Figure 5: Two-person results. We compare our method with state-of-the-art baselines on cHDG with two persons.
  • ...and 2 more figures