Table of Contents
Fetching ...

GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

Xinran Liu, Xu Dong, Shenbin Qian, Diptesh Kanojia, Wenwu Wang, Zhenhua Feng

TL;DR

GCDance presents a diffusion-based framework for genre-controlled 3D full-body dance generation conditioned on music and text. It introduces a text-driven genre control signal via a genre classifier and prompt encoder, fuses music foundation features with hand-crafted cues through Wav2CLIP, and employs a multi-task learning strategy (Nash MTL or Aligned MTL) to balance realism, spatial accuracy, and genre alignment. The method uses independent body and hand decoders with FiLM conditioning for genre adaptation and diffusion-based sampling with inpainting to support editing and long-term coherence. Extensive experiments on FineDance and AIST++ show superior motion quality, diversity, and controllability, with favorable efficiency, and qualitative analyses demonstrate clear genre-driven stylistic differences. The work advances controllable, semantically aligned dance synthesis and highlights the value of cross-modal alignment and multi-objective optimization in diffusion-based motion generation.

Abstract

Music-driven dance generation is a challenging task as it requires strict adherence to genre-specific choreography while ensuring physically realistic and precisely synchronized dance sequences with the music's beats and rhythm. Although significant progress has been made in music-conditioned dance generation, most existing methods struggle to convey specific stylistic attributes in generated dance. To bridge this gap, we propose a diffusion-based framework for genre-specific 3D full-body dance generation, conditioned on both music and descriptive text. To effectively incorporate genre information, we develop a text-based control mechanism that maps input prompts, either explicit genre labels or free-form descriptive text, into genre-specific control signals, enabling precise and controllable text-guided generation of genre-consistent dance motions. Furthermore, to enhance the alignment between music and textual conditions, we leverage the features of a music foundation model, facilitating coherent and semantically aligned dance synthesis. Last, to balance the objectives of extracting text-genre information and maintaining high-quality generation results, we propose a novel multi-task optimization strategy. This effectively balances competing factors such as physical realism, spatial accuracy, and text classification, significantly improving the overall quality of the generated sequences. Extensive experimental results obtained on the FineDance and AIST++ datasets demonstrate the superiority of GCDance over the existing state-of-the-art approaches.

GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

TL;DR

GCDance presents a diffusion-based framework for genre-controlled 3D full-body dance generation conditioned on music and text. It introduces a text-driven genre control signal via a genre classifier and prompt encoder, fuses music foundation features with hand-crafted cues through Wav2CLIP, and employs a multi-task learning strategy (Nash MTL or Aligned MTL) to balance realism, spatial accuracy, and genre alignment. The method uses independent body and hand decoders with FiLM conditioning for genre adaptation and diffusion-based sampling with inpainting to support editing and long-term coherence. Extensive experiments on FineDance and AIST++ show superior motion quality, diversity, and controllability, with favorable efficiency, and qualitative analyses demonstrate clear genre-driven stylistic differences. The work advances controllable, semantically aligned dance synthesis and highlights the value of cross-modal alignment and multi-objective optimization in diffusion-based motion generation.

Abstract

Music-driven dance generation is a challenging task as it requires strict adherence to genre-specific choreography while ensuring physically realistic and precisely synchronized dance sequences with the music's beats and rhythm. Although significant progress has been made in music-conditioned dance generation, most existing methods struggle to convey specific stylistic attributes in generated dance. To bridge this gap, we propose a diffusion-based framework for genre-specific 3D full-body dance generation, conditioned on both music and descriptive text. To effectively incorporate genre information, we develop a text-based control mechanism that maps input prompts, either explicit genre labels or free-form descriptive text, into genre-specific control signals, enabling precise and controllable text-guided generation of genre-consistent dance motions. Furthermore, to enhance the alignment between music and textual conditions, we leverage the features of a music foundation model, facilitating coherent and semantically aligned dance synthesis. Last, to balance the objectives of extracting text-genre information and maintaining high-quality generation results, we propose a novel multi-task optimization strategy. This effectively balances competing factors such as physical realism, spatial accuracy, and text classification, significantly improving the overall quality of the generated sequences. Extensive experimental results obtained on the FineDance and AIST++ datasets demonstrate the superiority of GCDance over the existing state-of-the-art approaches.

Paper Structure

This paper contains 18 sections, 15 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: Given an audio input and a genre-descriptive textual prompt, GCDance generates 3D dance sequences that align well with the musical melody and beat while adhering to the textual instruction.
  • Figure 2: An overview of GCDance. Left: the multimodal inputs and feature extraction. Middle: the training process at a given diffusion timestep $t$. Right: the sampling process, where a sequence of dance motions is generated iteratively.
  • Figure 3: The decoder of GCDance.
  • Figure 4: The control module of GCDance.
  • Figure 5: GCDance can generate joint-specific and temporally-specific dance segments. In the left example, the constrained body joints are shown in gray, while the generated hand joints are depicted in cyan. In the middle example, the constrained upper-body joints are shown in purple, and the generated leg joints are depicted in yellow. In the right example, the constrained first second is shown in red, while the generated last three seconds are depicted in green.
  • ...and 3 more figures