DGFM: Full Body Dance Generation Driven by Music Foundation Models
Xinran Liu, Zhenhua Feng, Diptesh Kanojia, Wenwu Wang
TL;DR
The paper tackles full-body music-driven dance generation by leveraging diffusion models conditioned on both music and text. It proposes DGFM, which fuses high-level music features from music foundation models (notably Wav2CLIP) with hand-crafted features (STFT) and CLIP-derived genre prompts to guide a Transformer-based denoiser operating on SMPL representations. Through extensive experiments on the FineDance dataset, DGFM—especially when combining Wav2CLIP with STFT—achieves superior motion realism, beat synchronization, and diversity compared with baselines and state-of-the-art methods. The work demonstrates that integrating foundation-model audio representations with targeted hand-crafted features yields tangible improvements in cross-modal dance generation, with potential for more expressive and genre-aware choreography.
Abstract
In music-driven dance motion generation, most existing methods use hand-crafted features and neglect that music foundation models have profoundly impacted cross-modal content generation. To bridge this gap, we propose a diffusion-based method that generates dance movements conditioned on text and music. Our approach extracts music features by combining high-level features obtained by music foundation model with hand-crafted features, thereby enhancing the quality of generated dance sequences. This method effectively leverages the advantages of high-level semantic information and low-level temporal details to improve the model's capability in music feature understanding. To show the merits of the proposed method, we compare it with four music foundation models and two sets of hand-crafted music features. The results demonstrate that our method obtains the most realistic dance sequences and achieves the best match with the input music.
