Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

Xuanchen Wang; Heng Wang; Dongnan Liu; Weidong Cai

Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

Xuanchen Wang, Heng Wang, Dongnan Liu, Weidong Cai

TL;DR

DabFusion tackles music-guided dance video generation for arbitrary individuals from a single reference image, removing the need for keypoint annotations. It introduces a two-stage framework: a latent-flow auto-encoder to model motion in latent space and a diffusion-based latent-flow generator conditioned on a rich music representation $e=[e_c,e_w,e_b]$, where $e_c$ comes from a fine-tuned CLAP and $e_w$ from a fine-tuned Wav2CLIP, complemented by beat cues from Librosa. Trained on the AIST++ dataset, the method achieves high video quality and strong audio-video synchronization, introducing the 2D motion-music alignment score to better capture rhythmical alignment in 2D. DabFusion also demonstrates Choreograph Anyone, enabling unseen individuals to dance from fused images, and provides an ablation study showing beat information and sequence length materially impact alignment metrics. Overall, the approach establishes a solid baseline for personalized, music-driven dance video synthesis with practical potential in choreography, AR/VR, and entertainment.

Abstract

Generating dance from music is crucial for advancing automated choreography. Current methods typically produce skeleton keypoint sequences instead of dance videos and lack the capability to make specific individuals dance, which reduces their real-world applicability. These methods also require precise keypoint annotations, complicating data collection and limiting the use of self-collected video datasets. To overcome these challenges, we introduce a novel task: generating dance videos directly from images of individuals guided by music. This task enables the dance generation of specific individuals without requiring keypoint annotations, making it more versatile and applicable to various situations. Our solution, the Dance Any Beat Diffusion model (DabFusion), utilizes a reference image and a music piece to generate dance videos featuring various dance types and choreographies. The music is analyzed by our specially designed music encoder, which identifies essential features including dance style, movement, and rhythm. DabFusion excels in generating dance videos not only for individuals in the training dataset but also for any previously unseen person. This versatility stems from its approach of generating latent optical flow, which contains all necessary motion information to animate any person in the image. We evaluate DabFusion's performance using the AIST++ dataset, focusing on video quality, audio-video synchronization, and motion-music alignment. We propose a 2D Motion-Music Alignment Score (2D-MM Align), which builds on the Beat Alignment Score to more effectively evaluate motion-music alignment for this new task. Experiments show that our DabFusion establishes a solid baseline for this innovative task. Video results can be found on our project page: https://DabFusion.github.io.

Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

TL;DR

, where

comes from a fine-tuned CLAP and

from a fine-tuned Wav2CLIP, complemented by beat cues from Librosa. Trained on the AIST++ dataset, the method achieves high video quality and strong audio-video synchronization, introducing the 2D motion-music alignment score to better capture rhythmical alignment in 2D. DabFusion also demonstrates Choreograph Anyone, enabling unseen individuals to dance from fused images, and provides an ablation study showing beat information and sequence length materially impact alignment metrics. Overall, the approach establishes a solid baseline for personalized, music-driven dance video synthesis with practical potential in choreography, AR/VR, and entertainment.

Abstract

Paper Structure (15 sections, 7 equations, 7 figures, 6 tables)

This paper contains 15 sections, 7 equations, 7 figures, 6 tables.

Introduction
Related Works
Method
Overview
Music Encoding
Latent Flow Estimation
Latent Flow Generation
Experiments and Results
Dataset and Implementation Details
Evaluation Metrics
Result Analysis
Choreograph Anyone
Ablation Study
Discussion
Conclusion

Figures (7)

Figure 1: We introduce DabFusion, a diffusion-based framework designed to generate videos of individuals dancing, utilizing a music input and an initial reference image as the conditions for video generation.
Figure 2: Exemplar videos generated from our DabFusion. Taking first image as starting frame and the unique music clip as guiding dance style, our framework is capable of generating varied styles of dance videos featuring different dancers from multiple perspectives with diverse initial poses and positions.
Figure 3: Overview of DabFusion. Given a reference image $x_{0}$ with dimensions $H_{x} \times W_{x} \times 3$ and a piece of music $m$. DabFusion incorporates noise input along with image embedding $z_{0}$ which has dimensions $H_{z} \times W_{z} \times C_{z}$ and music embedding $e$ as conditions. Following the denoising stage of the diffusion model, we obtain $h_{1}^{N}$ which has dimensions $H_{z} \times W_{z} \times 3 \times N$ , comprising a concatenated sequence of latent flow and corresponding occlusion maps. $h_{1}^{N}$ is utilized to transform $z_{0}$ into a new sequence of latent maps, denoted as $\tilde{z}_{1}^{N}$, which is subsequently decoded to produce an image sequence.
Figure 4: Training of latent flow auto-encoder. The flow predictor learns to estimate the latent flow $f$ and occlusion map $m$ between the reference frame $x_{ref}$ and the driving frame $x_{dri}$. The image encoder encodes $x_{ref}$ into a latent representation $z$, $f$ and $m$ are utilized to manipulate $z$ into $\tilde{z}$ which is then decoded by an image decoder to generate an output image $\hat{x}_{out}$. The objective of the training is to minimize the disparity between $x_{dri}$ and $\hat{x}_{out}$.
Figure 5: Comparison of video quality between ground-truth videos and those generated by DabFusion. We use the same starting image and music piece to generate videos with our model, and select three frames from the same position.
...and 2 more figures

Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

TL;DR

Abstract

Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)