GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation
Xinran Liu, Xu Dong, Shenbin Qian, Diptesh Kanojia, Wenwu Wang, Zhenhua Feng
TL;DR
GCDance presents a diffusion-based framework for genre-controlled 3D full-body dance generation conditioned on music and text. It introduces a text-driven genre control signal via a genre classifier and prompt encoder, fuses music foundation features with hand-crafted cues through Wav2CLIP, and employs a multi-task learning strategy (Nash MTL or Aligned MTL) to balance realism, spatial accuracy, and genre alignment. The method uses independent body and hand decoders with FiLM conditioning for genre adaptation and diffusion-based sampling with inpainting to support editing and long-term coherence. Extensive experiments on FineDance and AIST++ show superior motion quality, diversity, and controllability, with favorable efficiency, and qualitative analyses demonstrate clear genre-driven stylistic differences. The work advances controllable, semantically aligned dance synthesis and highlights the value of cross-modal alignment and multi-objective optimization in diffusion-based motion generation.
Abstract
Music-driven dance generation is a challenging task as it requires strict adherence to genre-specific choreography while ensuring physically realistic and precisely synchronized dance sequences with the music's beats and rhythm. Although significant progress has been made in music-conditioned dance generation, most existing methods struggle to convey specific stylistic attributes in generated dance. To bridge this gap, we propose a diffusion-based framework for genre-specific 3D full-body dance generation, conditioned on both music and descriptive text. To effectively incorporate genre information, we develop a text-based control mechanism that maps input prompts, either explicit genre labels or free-form descriptive text, into genre-specific control signals, enabling precise and controllable text-guided generation of genre-consistent dance motions. Furthermore, to enhance the alignment between music and textual conditions, we leverage the features of a music foundation model, facilitating coherent and semantically aligned dance synthesis. Last, to balance the objectives of extracting text-genre information and maintaining high-quality generation results, we propose a novel multi-task optimization strategy. This effectively balances competing factors such as physical realism, spatial accuracy, and text classification, significantly improving the overall quality of the generated sequences. Extensive experimental results obtained on the FineDance and AIST++ datasets demonstrate the superiority of GCDance over the existing state-of-the-art approaches.
