Table of Contents
Fetching ...

DanceMeld: Unraveling Dance Phrases with Hierarchical Latent Codes for Music-to-Dance Synthesis

Xin Gao, Li Hu, Peng Zhang, Bang Zhang, Liefeng Bo

TL;DR

DanceMeld tackles the challenge of music-to-dance synthesis by addressing the inherent one-to-many mapping through a two-stage pipeline. It introduces a hierarchical VQ-VAE to explicitly decouple dance poses (bottom code) from dance movements (top code) and uses a music-conditioned diffusion prior to generate latent codes that are decoded into motion; a suite of auxiliary and modality alignment losses further improve realism and rhythm matching. The approach yields interpretable latent representations, enabling dance style transfer and editing, and achieves state-of-the-art results on the AIST++ dataset both qualitatively and quantitatively. This work advances practical 3D character dance generation by combining choreography-inspired structure with powerful diffusion-based generation, offering controllability and potential for broader choreography-aware synthesis applications.

Abstract

In the realm of 3D digital human applications, music-to-dance presents a challenging task. Given the one-to-many relationship between music and dance, previous methods have been limited in their approach, relying solely on matching and generating corresponding dance movements based on music rhythm. In the professional field of choreography, a dance phrase consists of several dance poses and dance movements. Dance poses composed of a series of basic meaningful body postures, while dance movements can reflect dynamic changes such as the rhythm, melody, and style of dance. Taking inspiration from these concepts, we introduce an innovative dance generation pipeline called DanceMeld, which comprising two stages, i.e., the dance decouple stage and the dance generation stage. In the decouple stage, a hierarchical VQ-VAE is used to disentangle dance poses and dance movements in different feature space levels, where the bottom code represents dance poses, and the top code represents dance movements. In the generation stage, we utilize a diffusion model as a prior to model the distribution and generate latent codes conditioned on music features. We have experimentally demonstrated the representational capabilities of top code and bottom code, enabling the explicit decoupling expression of dance poses and dance movements. This disentanglement not only provides control over motion details, styles, and rhythm but also facilitates applications such as dance style transfer and dance unit editing. Our approach has undergone qualitative and quantitative experiments on the AIST++ dataset, demonstrating its superiority over other methods.

DanceMeld: Unraveling Dance Phrases with Hierarchical Latent Codes for Music-to-Dance Synthesis

TL;DR

DanceMeld tackles the challenge of music-to-dance synthesis by addressing the inherent one-to-many mapping through a two-stage pipeline. It introduces a hierarchical VQ-VAE to explicitly decouple dance poses (bottom code) from dance movements (top code) and uses a music-conditioned diffusion prior to generate latent codes that are decoded into motion; a suite of auxiliary and modality alignment losses further improve realism and rhythm matching. The approach yields interpretable latent representations, enabling dance style transfer and editing, and achieves state-of-the-art results on the AIST++ dataset both qualitatively and quantitatively. This work advances practical 3D character dance generation by combining choreography-inspired structure with powerful diffusion-based generation, offering controllability and potential for broader choreography-aware synthesis applications.

Abstract

In the realm of 3D digital human applications, music-to-dance presents a challenging task. Given the one-to-many relationship between music and dance, previous methods have been limited in their approach, relying solely on matching and generating corresponding dance movements based on music rhythm. In the professional field of choreography, a dance phrase consists of several dance poses and dance movements. Dance poses composed of a series of basic meaningful body postures, while dance movements can reflect dynamic changes such as the rhythm, melody, and style of dance. Taking inspiration from these concepts, we introduce an innovative dance generation pipeline called DanceMeld, which comprising two stages, i.e., the dance decouple stage and the dance generation stage. In the decouple stage, a hierarchical VQ-VAE is used to disentangle dance poses and dance movements in different feature space levels, where the bottom code represents dance poses, and the top code represents dance movements. In the generation stage, we utilize a diffusion model as a prior to model the distribution and generate latent codes conditioned on music features. We have experimentally demonstrated the representational capabilities of top code and bottom code, enabling the explicit decoupling expression of dance poses and dance movements. This disentanglement not only provides control over motion details, styles, and rhythm but also facilitates applications such as dance style transfer and dance unit editing. Our approach has undergone qualitative and quantitative experiments on the AIST++ dataset, demonstrating its superiority over other methods.
Paper Structure (23 sections, 13 equations, 8 figures, 3 tables)

This paper contains 23 sections, 13 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: In the field of choreography, dance poses composed of a series of basic meaningful body postures, while dance movements can reflect trends, rhythm and energy of the motion. Our method uses a hierarchical VQ-VAE to decouple dance poses and dance movements by representing them with bottom code and top code (short line for bottom code and long line for top code).
  • Figure 2: Our method comprises two stages: the dance decouple stage and the dance generation stage. In the dance decouple stage, a hierarchical VQ-VAE is trained to decouple dance pose and dance movement by representing them with bottom code $\bm{e}_b$ and top code $\bm{e}_t$. In the motion generation phase, a diffusion model is employed as a prior to model the distribution of latent codes. The latent codes are then decoded into dance sequences by a motion decoder.
  • Figure 3: Different prior model arthitecture. (a) Separatel model $p(\bm{h}_t | \bm{h}_m)$ and $p(\bm{h}_b^{'} | \bm{h}_t, \bm{h}_m)$. (b) Predict the joint probability $p(\bm{h}_b^{'}, \bm{h}_t | \bm{h}_m)$ by concatenating $h_t$ and $h_b'$.
  • Figure 4: After using modality alignment loss, beats between dance motions and music are more synchronized.
  • Figure 5: We present dance poses for 12 distinct bottom codes, combining each with a random selection of several top codes for visualization. The results reveal that the poses are constrained within a fixed range.
  • ...and 3 more figures