DanceMeld: Unraveling Dance Phrases with Hierarchical Latent Codes for Music-to-Dance Synthesis
Xin Gao, Li Hu, Peng Zhang, Bang Zhang, Liefeng Bo
TL;DR
DanceMeld tackles the challenge of music-to-dance synthesis by addressing the inherent one-to-many mapping through a two-stage pipeline. It introduces a hierarchical VQ-VAE to explicitly decouple dance poses (bottom code) from dance movements (top code) and uses a music-conditioned diffusion prior to generate latent codes that are decoded into motion; a suite of auxiliary and modality alignment losses further improve realism and rhythm matching. The approach yields interpretable latent representations, enabling dance style transfer and editing, and achieves state-of-the-art results on the AIST++ dataset both qualitatively and quantitatively. This work advances practical 3D character dance generation by combining choreography-inspired structure with powerful diffusion-based generation, offering controllability and potential for broader choreography-aware synthesis applications.
Abstract
In the realm of 3D digital human applications, music-to-dance presents a challenging task. Given the one-to-many relationship between music and dance, previous methods have been limited in their approach, relying solely on matching and generating corresponding dance movements based on music rhythm. In the professional field of choreography, a dance phrase consists of several dance poses and dance movements. Dance poses composed of a series of basic meaningful body postures, while dance movements can reflect dynamic changes such as the rhythm, melody, and style of dance. Taking inspiration from these concepts, we introduce an innovative dance generation pipeline called DanceMeld, which comprising two stages, i.e., the dance decouple stage and the dance generation stage. In the decouple stage, a hierarchical VQ-VAE is used to disentangle dance poses and dance movements in different feature space levels, where the bottom code represents dance poses, and the top code represents dance movements. In the generation stage, we utilize a diffusion model as a prior to model the distribution and generate latent codes conditioned on music features. We have experimentally demonstrated the representational capabilities of top code and bottom code, enabling the explicit decoupling expression of dance poses and dance movements. This disentanglement not only provides control over motion details, styles, and rhythm but also facilitates applications such as dance style transfer and dance unit editing. Our approach has undergone qualitative and quantitative experiments on the AIST++ dataset, demonstrating its superiority over other methods.
