Table of Contents
Fetching ...

TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography

Yuqin Dai, Wanlu Zhu, Ronghui Li, Xiu Li, Zhenyu Zhang, Jun Li, Jian Yang

TL;DR

This work tackles music-driven group choreography by introducing TCDiff++, an end-to-end diffusion framework that jointly models global group formation and local footwork. It integrates a Dancer Positioning Embedding, Fusion Projection, and a Sequence Decoder in the Group Dance Decoder, plus a Footwork Adaptor to refine lower-body motion, and it employs Long Group Diffusion Sampling to maintain coherence over long sequences. The model is trained with a composite loss that enforces temporal consistency, foot-ground contact, forward-kinematics fidelity, and global inter-dancer spacing via a distance-consistency term. Empirical results show state-of-the-art performance, especially for long-duration dances, with clear improvements in avoiding dancer collisions and reducing foot sliding, enabling more realistic and cohesive group choreography from music inputs.

Abstract

Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper, we propose TCDiff++, a music-driven end-to-end framework designed to generate harmonious group dance. Specifically, to mitigate multi-dancer collisions, we utilize a dancer positioning embedding to encode temporal and identity information. Additionally, we incorporate a distance-consistency loss to ensure that inter-dancer distances remain within plausible ranges. To address the issue of single-dancer foot sliding, we introduce a swap mode embedding to indicate dancer swapping patterns and design a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For long group dance generation, we present a long group diffusion sampling strategy that reduces abrupt position shifts by injecting positional information into the noisy input. Furthermore, we integrate a Sequence Decoder layer to enhance the model's ability to selectively process long sequences. Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art performance, particularly in long-duration scenarios, ensuring high-quality and coherent group dance generation.

TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography

TL;DR

This work tackles music-driven group choreography by introducing TCDiff++, an end-to-end diffusion framework that jointly models global group formation and local footwork. It integrates a Dancer Positioning Embedding, Fusion Projection, and a Sequence Decoder in the Group Dance Decoder, plus a Footwork Adaptor to refine lower-body motion, and it employs Long Group Diffusion Sampling to maintain coherence over long sequences. The model is trained with a composite loss that enforces temporal consistency, foot-ground contact, forward-kinematics fidelity, and global inter-dancer spacing via a distance-consistency term. Empirical results show state-of-the-art performance, especially for long-duration dances, with clear improvements in avoiding dancer collisions and reducing foot sliding, enabling more realistic and cohesive group choreography from music inputs.

Abstract

Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper, we propose TCDiff++, a music-driven end-to-end framework designed to generate harmonious group dance. Specifically, to mitigate multi-dancer collisions, we utilize a dancer positioning embedding to encode temporal and identity information. Additionally, we incorporate a distance-consistency loss to ensure that inter-dancer distances remain within plausible ranges. To address the issue of single-dancer foot sliding, we introduce a swap mode embedding to indicate dancer swapping patterns and design a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For long group dance generation, we present a long group diffusion sampling strategy that reduces abrupt position shifts by injecting positional information into the noisy input. Furthermore, we integrate a Sequence Decoder layer to enhance the model's ability to selectively process long sequences. Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art performance, particularly in long-duration scenarios, ensuring high-quality and coherent group dance generation.

Paper Structure

This paper contains 23 sections, 20 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Visualizations of three key issues in baseline models: multi-dancer collisions tcdiff, single-dancer foot sliding codancers, and long group dance generation gcd, where the blue man and the purple man suddenly swapped positions. In contrast, our approach eliminates these issues, delivering superior visual aesthetics.
  • Figure 2: Our end-to-end TCDiff++ framework comprises two key components: the Group Dance Decoder (GDD) and the Footwork Adaptor (FA). The GDD initially generates a raw motion sequence $\hat{\boldsymbol{x}}^r$ without trajectory overlap based on the given music. Subsequently, the FA refines the foot movements by leveraging the positional information of the raw motion, producing an adapted motion $\hat{x}_0^a$ with improved footstep actions to reduce foot sliding. Finally, the adapted footstep movements are incorporated into the raw motion, yielding a harmonious dance sequence $\hat{\boldsymbol{x}}_0$ with stable footwork and less dancer collisions. Compared to the previous two-stage version, TCDiff++ requires only a single training stage, demonstrating better footwork-motion coherence performance.
  • Figure 3: Our Fusion Projection (FP) module addresses the issue of dancer ambiguity. Imbalanced feature representations can cause positions to be misinterpreted as similar, leading to identical predictions. The FP module increases input dimensionality to enhance dancer differentiation, preserving positional differences and reducing collisions.
  • Figure 4: Our Long Group Diffusion Sampling (LGDS) method initially generates segments with partial overlap, which are then merged to form a complete sequence. Unlike naive sampling, LGDS enforces consistency during the input phase rather than the sampling phase. This approach reduces randomness and ensures cleaner positional information during generation, thereby reducing abrupt swap.
  • Figure 5: Visual comparison with Baselines. Baselines often cause collisions (highlighted in the red box) during exchanges.
  • ...and 5 more figures