Dance Any Beat: Blending Beats with Visuals in Dance Video Generation
Xuanchen Wang, Heng Wang, Dongnan Liu, Weidong Cai
TL;DR
DabFusion tackles music-guided dance video generation for arbitrary individuals from a single reference image, removing the need for keypoint annotations. It introduces a two-stage framework: a latent-flow auto-encoder to model motion in latent space and a diffusion-based latent-flow generator conditioned on a rich music representation $e=[e_c,e_w,e_b]$, where $e_c$ comes from a fine-tuned CLAP and $e_w$ from a fine-tuned Wav2CLIP, complemented by beat cues from Librosa. Trained on the AIST++ dataset, the method achieves high video quality and strong audio-video synchronization, introducing the 2D motion-music alignment score to better capture rhythmical alignment in 2D. DabFusion also demonstrates Choreograph Anyone, enabling unseen individuals to dance from fused images, and provides an ablation study showing beat information and sequence length materially impact alignment metrics. Overall, the approach establishes a solid baseline for personalized, music-driven dance video synthesis with practical potential in choreography, AR/VR, and entertainment.
Abstract
Generating dance from music is crucial for advancing automated choreography. Current methods typically produce skeleton keypoint sequences instead of dance videos and lack the capability to make specific individuals dance, which reduces their real-world applicability. These methods also require precise keypoint annotations, complicating data collection and limiting the use of self-collected video datasets. To overcome these challenges, we introduce a novel task: generating dance videos directly from images of individuals guided by music. This task enables the dance generation of specific individuals without requiring keypoint annotations, making it more versatile and applicable to various situations. Our solution, the Dance Any Beat Diffusion model (DabFusion), utilizes a reference image and a music piece to generate dance videos featuring various dance types and choreographies. The music is analyzed by our specially designed music encoder, which identifies essential features including dance style, movement, and rhythm. DabFusion excels in generating dance videos not only for individuals in the training dataset but also for any previously unseen person. This versatility stems from its approach of generating latent optical flow, which contains all necessary motion information to animate any person in the image. We evaluate DabFusion's performance using the AIST++ dataset, focusing on video quality, audio-video synchronization, and motion-music alignment. We propose a 2D Motion-Music Alignment Score (2D-MM Align), which builds on the Beat Alignment Score to more effectively evaluate motion-music alignment for this new task. Experiments show that our DabFusion establishes a solid baseline for this innovative task. Video results can be found on our project page: https://DabFusion.github.io.
