Table of Contents
Fetching ...

OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data

Jinlu Zhang, Zixi Kang, Libin Liu, Jianlong Chang, Qi Tian, Feng Gao, Yizhou Wang

TL;DR

This work tackles the scarcity of richly annotated multimodal data for 3D dance generation and the need for flexible, controllable generation. It introduces OpenDanceSet, a large-scale dataset with 100.26 hours across 14 genres and five synchronized modalities, and OpenDanceNet, a unified masked modeling framework that uses a Disentangled Dance Tokenizer and a Multimodal-Condition Transformer to fuse music, text, keypoints, and trajectories. Through extensive experiments on AIST++ and OpenDanceSet, the approach achieves high fidelity, diverse motions, strong beat alignment, and improved controllability over prior methods. Limitations include limited hand/facial detail and simplified text tokens, with future work aiming to enrich modalities and tokenizer capability for richer editing and semantics.

Abstract

Music-driven 3D dance generation offers significant creative potential, yet practical applications demand versatile and multimodal control. As the highly dynamic and complex human motion covering various styles and genres, dance generation requires satisfying diverse conditions beyond just music (e.g., spatial trajectories, keyframe gestures, or style descriptions). However, the absence of a large-scale and richly annotated dataset severely hinders progress. In this paper, we build OpenDanceSet, an extensive human dance dataset comprising over 100 hours across 14 genres and 147 subjects. Each sample has rich annotations to facilitate robust cross-modal learning: 3D motion, paired music, 2D keypoints, trajectories, and expert-annotated text descriptions. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation, including a disentangled auto-encoder and a multimodal joint-prediction Transformer. OpenDanceNet supports generation conditioned on music and arbitrary combinations of text, keypoints, or trajectories. Comprehensive experiments demonstrate that our work achieves high-fidelity synthesis with strong diversity and realistic physical contacts, while also offering flexible control over spatial and stylistic conditions. Project Page: https://open-dance.github.io

OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data

TL;DR

This work tackles the scarcity of richly annotated multimodal data for 3D dance generation and the need for flexible, controllable generation. It introduces OpenDanceSet, a large-scale dataset with 100.26 hours across 14 genres and five synchronized modalities, and OpenDanceNet, a unified masked modeling framework that uses a Disentangled Dance Tokenizer and a Multimodal-Condition Transformer to fuse music, text, keypoints, and trajectories. Through extensive experiments on AIST++ and OpenDanceSet, the approach achieves high fidelity, diverse motions, strong beat alignment, and improved controllability over prior methods. Limitations include limited hand/facial detail and simplified text tokens, with future work aiming to enrich modalities and tokenizer capability for richer editing and semantics.

Abstract

Music-driven 3D dance generation offers significant creative potential, yet practical applications demand versatile and multimodal control. As the highly dynamic and complex human motion covering various styles and genres, dance generation requires satisfying diverse conditions beyond just music (e.g., spatial trajectories, keyframe gestures, or style descriptions). However, the absence of a large-scale and richly annotated dataset severely hinders progress. In this paper, we build OpenDanceSet, an extensive human dance dataset comprising over 100 hours across 14 genres and 147 subjects. Each sample has rich annotations to facilitate robust cross-modal learning: 3D motion, paired music, 2D keypoints, trajectories, and expert-annotated text descriptions. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation, including a disentangled auto-encoder and a multimodal joint-prediction Transformer. OpenDanceNet supports generation conditioned on music and arbitrary combinations of text, keypoints, or trajectories. Comprehensive experiments demonstrate that our work achieves high-fidelity synthesis with strong diversity and realistic physical contacts, while also offering flexible control over spatial and stylistic conditions. Project Page: https://open-dance.github.io

Paper Structure

This paper contains 21 sections, 6 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: We present a multimodal large-scale human dance dataset OpenDanceSet and develop the masked modeling framework OpenDanceNet for controllable and flexible dance generation conditioned on any "Music+X" setting (X: 2D keypoints, trajectory, and texts).
  • Figure 2: (Left) Overview of OpenDanceSet. We design an annotation pipeline to build the large-scale multimodal dance dataset, including 100.26 hours across 14 dance genres, with diverse annotations. (Right) The genre and detailed genre distribution in OpenDanceSet.
  • Figure 3: Comparison of processed OpenDanceSet, before filtered OpenDanceSet, and AIST++. After post-optimization and filtering, OpenDanceSet fits dance data distribution and achieves better physical performance.
  • Figure 4: Overview of OpenDanceNet, a masked-modeling-based dance generation framework. (a) We first train a Disentangled Dance Tokenizer (DDT) to quantize spatial signals (joint rotations, global trajectories, and 2D keypoints) into discrete tokens. (b) Then the Multimodal-Condition Transformer (MCT) is trained by randomly sampling subsets of control modalities and applying token-level masks over trajectories, 2D keypoints, and motion tokens, enabling the model to handle diverse condition combinations while generating coherent dance motions. (c) At inference time, OpenDanceNet supports arbitrary configurations of input conditions for flexible multimodal control, while Multi-Step Logit-Ranked Re-Masking (MS-LRM) and footstep optimization progressively refine the generated motions and improve their physical plausibility.
  • Figure 5: Different spatial control signals visualization. The top two rows use GT keypoint (only last frame) and GT trajectory as conditions, and bottom two rows use random geometry trajectory. We recommend viewing videos in supplementary.