Table of Contents
Fetching ...

Lodge++: High-quality and Long Dance Generation with Vivid Choreography Patterns

Ronghui Li, Hongwen Zhang, Yachao Zhang, Yuxiang Zhang, Youliang Zhang, Jie Guo, Yan Zhang, Xiu Li, Yebin Liu

TL;DR

Lodge++ tackles the problem of generating ultra-long, high-quality, music-driven 3D dances by decoupling global choreography from local motion. It introduces a VQ-VAE+GPT based Global Choreography Network to learn rich global patterns and derive dance primitives, followed by a Primitive-based Diffusion Model that denoises in parallel to produce long sequences guided by those primitives. The approach is augmented with a Foot Refine Block, a Multi-Genre Discriminator, and an SDF-based Penetration Guidance to improve physical realism and genre consistency, achieving superior beat alignment and lower self-penetration on the FineDance dataset. Ablation studies and user surveys substantiate the benefits of the global-primitives–diffusion coupling and the proposed physically informed refinements. Overall, Lodge++ advances long-sequence dance generation by delivering coherent choreography and high-detail movement with improved computational efficiency.

Abstract

We propose Lodge++, a choreography framework to generate high-quality, ultra-long, and vivid dances given the music and desired genre. To handle the challenges in computational efficiency, the learning of complex and vivid global choreography patterns, and the physical quality of local dance movements, Lodge++ adopts a two-stage strategy to produce dances from coarse to fine. In the first stage, a global choreography network is designed to generate coarse-grained dance primitives that capture complex global choreography patterns. In the second stage, guided by these dance primitives, a primitive-based dance diffusion model is proposed to further generate high-quality, long-sequence dances in parallel, faithfully adhering to the complex choreography patterns. Additionally, to improve the physical plausibility, Lodge++ employs a penetration guidance module to resolve character self-penetration, a foot refinement module to optimize foot-ground contact, and a multi-genre discriminator to maintain genre consistency throughout the dance. Lodge++ is validated by extensive experiments, which show that our method can rapidly generate ultra-long dances suitable for various dance genres, ensuring well-organized global choreography patterns and high-quality local motion.

Lodge++: High-quality and Long Dance Generation with Vivid Choreography Patterns

TL;DR

Lodge++ tackles the problem of generating ultra-long, high-quality, music-driven 3D dances by decoupling global choreography from local motion. It introduces a VQ-VAE+GPT based Global Choreography Network to learn rich global patterns and derive dance primitives, followed by a Primitive-based Diffusion Model that denoises in parallel to produce long sequences guided by those primitives. The approach is augmented with a Foot Refine Block, a Multi-Genre Discriminator, and an SDF-based Penetration Guidance to improve physical realism and genre consistency, achieving superior beat alignment and lower self-penetration on the FineDance dataset. Ablation studies and user surveys substantiate the benefits of the global-primitives–diffusion coupling and the proposed physically informed refinements. Overall, Lodge++ advances long-sequence dance generation by delivering coherent choreography and high-detail movement with improved computational efficiency.

Abstract

We propose Lodge++, a choreography framework to generate high-quality, ultra-long, and vivid dances given the music and desired genre. To handle the challenges in computational efficiency, the learning of complex and vivid global choreography patterns, and the physical quality of local dance movements, Lodge++ adopts a two-stage strategy to produce dances from coarse to fine. In the first stage, a global choreography network is designed to generate coarse-grained dance primitives that capture complex global choreography patterns. In the second stage, guided by these dance primitives, a primitive-based dance diffusion model is proposed to further generate high-quality, long-sequence dances in parallel, faithfully adhering to the complex choreography patterns. Additionally, to improve the physical plausibility, Lodge++ employs a penetration guidance module to resolve character self-penetration, a foot refinement module to optimize foot-ground contact, and a multi-genre discriminator to maintain genre consistency throughout the dance. Lodge++ is validated by extensive experiments, which show that our method can rapidly generate ultra-long dances suitable for various dance genres, ensuring well-organized global choreography patterns and high-quality local motion.

Paper Structure

This paper contains 28 sections, 23 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An overview of Lodge++. "PDDM" is the proposed Primitive-based Dance Diffusion Model. Given the extremely long music and desired dance genre as input, Lodge++ uses the Global Choreography Network to generate dance primitives that contain global choreography patterns. Then, it leverages the parallel PDDM network to generate high-quality and coherent long-sequence choreographic dance.
  • Figure 2: The architecture of Lodge++. First, a Global Choreography Network is used to obtain coarse-grained dance motions. Then, expressive key motions near the dance beats of these coarse-grained motions are detected and aligned with their corresponding music beats, forming $\bm{d}_s$. These key motions serve to transfer the choreography patterns learned by the Global Choreography Network, further enhancing the expressiveness and beat alignment of the dances generated by PDDM. The 8 frames of motion near $\left\{in\right\}_{i=1}^l$ are extracted as $\bm{d}_h$, which are used to constrain the start and end 4 frames of PDDM-generated local dance, supporting parallel generation in PDDM. Both the $\bm{d}_s$ and $\bm{d}_h$ combine the dance primitives. Next, noise is merged with dance primitives to obtain $\Tilde{\bm{d}}^i_T$. The $\Tilde{\bm{d}}^i_T$ and the split music features $\bm{m}^i$ are then input into the Primitive-base Dance Diffusion Model in parallel. After $T$ denoising steps, the final generated dance is obtained.
  • Figure 3: Structure of Global Choreography Network. The Global Choreography Network consists of two parts: VQ-VAE and a sequence model. First, VQ-VAE encodes the dance movements into a choreography memory codebook. Then, the sequence model generates coarse-grained dance that adheres to choreography patterns based on the input music and dance style.
  • Figure 4: The forward and reverse process of the Primitive-based Dance Diffusion Model. $\mathcal{T}$ is the diffusion time steps.
  • Figure 5: The detailed network architecture of the Denoise Network in the Primitive-based Dance Diffusion Model.
  • ...and 2 more figures