Table of Contents
Fetching ...

Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior

Foram N Shah, Parshwa Shah, Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Ahmed Helmy

TL;DR

DanceMosaic provides a high-fidelity, editable 3D dance synthesis framework by marrying a text-conditioned motion prior with two modality-specific towers for music and pose. Through synchronized progressive masked training and inference-time multimodal guidance, it integrates music, genre text, and pose constraints to produce realistic, rhythmically aligned dances while enabling precise editing. Quantitative results on FineDance show state-of-the-art motion fidelity, diversity, and beat synchronization, with real-time performance and versatile editing capabilities. The approach offers practical impact for choreography, animation, and interactive dance applications by delivering controllable, scalable, and editable dance generation.

Abstract

Recent advances in dance generation have enabled the automatic synthesis of 3D dance motions. However, existing methods still face significant challenges in simultaneously achieving high realism, precise dance-music synchronization, diverse motion expression, and physical plausibility. To address these limitations, we propose a novel approach that leverages a generative masked text-to-motion model as a distribution prior to learn a probabilistic mapping from diverse guidance signals, including music, genre, and pose, into high-quality dance motion sequences. Our framework also supports semantic motion editing, such as motion inpainting and body part modification. Specifically, we introduce a multi-tower masked motion model that integrates a text-conditioned masked motion backbone with two parallel, modality-specific branches: a music-guidance tower and a pose-guidance tower. The model is trained using synchronized and progressive masked training, which allows effective infusion of the pretrained text-to-motion prior into the dance synthesis process while enabling each guidance branch to optimize independently through its own loss function, mitigating gradient interference. During inference, we introduce classifier-free logits guidance and pose-guided token optimization to strengthen the influence of music, genre, and pose signals. Extensive experiments demonstrate that our method sets a new state of the art in dance generation, significantly advancing both the quality and editability over existing approaches. Project Page available at https://foram-s1.github.io/DanceMosaic/

Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior

TL;DR

DanceMosaic provides a high-fidelity, editable 3D dance synthesis framework by marrying a text-conditioned motion prior with two modality-specific towers for music and pose. Through synchronized progressive masked training and inference-time multimodal guidance, it integrates music, genre text, and pose constraints to produce realistic, rhythmically aligned dances while enabling precise editing. Quantitative results on FineDance show state-of-the-art motion fidelity, diversity, and beat synchronization, with real-time performance and versatile editing capabilities. The approach offers practical impact for choreography, animation, and interactive dance applications by delivering controllable, scalable, and editable dance generation.

Abstract

Recent advances in dance generation have enabled the automatic synthesis of 3D dance motions. However, existing methods still face significant challenges in simultaneously achieving high realism, precise dance-music synchronization, diverse motion expression, and physical plausibility. To address these limitations, we propose a novel approach that leverages a generative masked text-to-motion model as a distribution prior to learn a probabilistic mapping from diverse guidance signals, including music, genre, and pose, into high-quality dance motion sequences. Our framework also supports semantic motion editing, such as motion inpainting and body part modification. Specifically, we introduce a multi-tower masked motion model that integrates a text-conditioned masked motion backbone with two parallel, modality-specific branches: a music-guidance tower and a pose-guidance tower. The model is trained using synchronized and progressive masked training, which allows effective infusion of the pretrained text-to-motion prior into the dance synthesis process while enabling each guidance branch to optimize independently through its own loss function, mitigating gradient interference. During inference, we introduce classifier-free logits guidance and pose-guided token optimization to strengthen the influence of music, genre, and pose signals. Extensive experiments demonstrate that our method sets a new state of the art in dance generation, significantly advancing both the quality and editability over existing approaches. Project Page available at https://foram-s1.github.io/DanceMosaic/

Paper Structure

This paper contains 14 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: DanceMosaic generates 3D dance motions based on multiple guidance signals. The top sequence showcases generated dance motions influenced by different text prompts, including genre-based or action-specific prompts. The color-coded figures represent different dance styles, synchronized with a music signal at the bottom. The pose signal allows further motion refinement, demonstrating the flexibility and precision of DanceMosaic.
  • Figure 2: Overview of DanceMosaic's training phase. (a) The process involves encoding dance motions into discrete token sequences using a dance motion tokenizer. (b) These tokens are then processed through a multi-tower masked motion model, where each tower (music, text, and pose) is used to learn the probabilistic mappings from modality-specific guidance signals to motion tokens. The model is trained using a progressive training strategy to integrate music, text, and pose signals.
  • Figure 3: Overview of DanceMosaic's inference phase. (a) Motion Prior: It parallelly encodes music, pose, and text conditions, passing the conditions through respective towers, which generates guidance for each modality. (b) Pose-guided Inference: At the final stage, we utilize inference time computing to refine the generated pose to align closely, if pose conditions are provided.
  • Figure 4: Various Applications Using DanceMosaic