Table of Contents
Fetching ...

MatchDance: Collaborative Mamba-Transformer Architecture Matching for High-Quality 3D Dance Synthesis

Kaixing Yang, Xulong Tang, Yuxuan Hu, Jiahao Yang, Hongyan Liu, Qinnan Zhang, Jun He, Zhaoxin Fan

TL;DR

MatchDance tackles the challenge of choreographic consistency in music-to-dance synthesis by decoupling dance quality from music-dance synchronization through a two-stage latent-space framework. The KDQS stage quantizes dancer motions into discrete codes with kinematic-dynamic constraints via Finite Scalar Quantization, while the HMDGS stage maps music to these latent codes using a Mamba-Transformer with Slide Window Attention and MuQ-based music representations. The approach is complemented by a retrieval-based evaluation framework and extensive experiments on FineDance that report state-of-the-art performance in both quantitative and qualitative metrics, as well as a real-time generation capability. Overall, MatchDance advances both the modeling and evaluation of music-to-dance systems, with practical implications for choreography, virtual reality, and AI-assisted creative content generation.

Abstract

Music-to-dance generation represents a challenging yet pivotal task at the intersection of choreography, virtual reality, and creative content generation. Despite its significance, existing methods face substantial limitation in achieving choreographic consistency. To address the challenge, we propose MatchDance, a novel framework for music-to-dance generation that constructs a latent representation to enhance choreographic consistency. MatchDance employs a two-stage design: (1) a Kinematic-Dynamic-based Quantization Stage (KDQS), which encodes dance motions into a latent representation by Finite Scalar Quantization (FSQ) with kinematic-dynamic constraints and reconstructs them with high fidelity, and (2) a Hybrid Music-to-Dance Generation Stage(HMDGS), which uses a Mamba-Transformer hybrid architecture to map music into the latent representation, followed by the KDQS decoder to generate 3D dance motions. Additionally, a music-dance retrieval framework and comprehensive metrics are introduced for evaluation. Extensive experiments on the FineDance dataset demonstrate state-of-the-art performance. Code will be released upon acceptance.

MatchDance: Collaborative Mamba-Transformer Architecture Matching for High-Quality 3D Dance Synthesis

TL;DR

MatchDance tackles the challenge of choreographic consistency in music-to-dance synthesis by decoupling dance quality from music-dance synchronization through a two-stage latent-space framework. The KDQS stage quantizes dancer motions into discrete codes with kinematic-dynamic constraints via Finite Scalar Quantization, while the HMDGS stage maps music to these latent codes using a Mamba-Transformer with Slide Window Attention and MuQ-based music representations. The approach is complemented by a retrieval-based evaluation framework and extensive experiments on FineDance that report state-of-the-art performance in both quantitative and qualitative metrics, as well as a real-time generation capability. Overall, MatchDance advances both the modeling and evaluation of music-to-dance systems, with practical implications for choreography, virtual reality, and AI-assisted creative content generation.

Abstract

Music-to-dance generation represents a challenging yet pivotal task at the intersection of choreography, virtual reality, and creative content generation. Despite its significance, existing methods face substantial limitation in achieving choreographic consistency. To address the challenge, we propose MatchDance, a novel framework for music-to-dance generation that constructs a latent representation to enhance choreographic consistency. MatchDance employs a two-stage design: (1) a Kinematic-Dynamic-based Quantization Stage (KDQS), which encodes dance motions into a latent representation by Finite Scalar Quantization (FSQ) with kinematic-dynamic constraints and reconstructs them with high fidelity, and (2) a Hybrid Music-to-Dance Generation Stage(HMDGS), which uses a Mamba-Transformer hybrid architecture to map music into the latent representation, followed by the KDQS decoder to generate 3D dance motions. Additionally, a music-dance retrieval framework and comprehensive metrics are introduced for evaluation. Extensive experiments on the FineDance dataset demonstrate state-of-the-art performance. Code will be released upon acceptance.

Paper Structure

This paper contains 37 sections, 11 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of MatchDance. MatchDance first trains multiple FSQs for lower and upper parts of body reconstruction by KDQS. In HMDGS, we train a music-driven dance generation model in latent representations with Mamba-Transformer Hybrid architecture and Slide Window Attention (SW-Attention) mechanism.
  • Figure 2: Architecture of the retrieval model.
  • Figure 3: Qualitative Analysis of Comparison. MatchDance demonstrates superior visual performance compared to existing methods. The generated dances not only achieve higher dance quality and better synchronization, but also exhibit greater diversity and complexity in movement patterns.