Controllable Dance Generation with Style-Guided Motion Diffusion

Hongsong Wang; Ying Zhu; Xin Geng; Liang Wang

Controllable Dance Generation with Style-Guided Motion Diffusion

Hongsong Wang, Ying Zhu, Xin Geng, Liang Wang

TL;DR

This work tackles controllable dance generation by introducing Style-Guided Motion Diffusion (SGMD), a transformer-based diffusion model that conditions on music and user-defined style prompts. A lightweight Style Modulation module injects stylistic cues without altering content, while Spatial-Temporal Masking enables precise editing under temporal and spatial constraints. The authors propose three style-prompt encodings and establish new benchmarks for trajectory-based generation, in-betweening, and inpainting, demonstrating state-of-the-art performance on long-term generation and controllable editing on the AIST++ dataset. They also show that style description prompts and expressive audio representations (notably Jukebox) improve alignment and realism, enabling practical interactive and creative dance generation applications.

Abstract

Dance plays an important role as an artistic form and expression in human culture, yet automatically generating dance sequences is a significant yet challenging endeavor. Existing approaches often neglect the critical aspect of controllability in dance generation. Additionally, they inadequately model the nuanced impact of music styles, resulting in dances that lack alignment with the expressive characteristics inherent in the conditioned music. To address this gap, we propose Style-Guided Motion Diffusion (SGMD), which integrates the Transformer-based architecture with a Style Modulation module. By incorporating music features with user-provided style prompts, the SGMD ensures that the generated dances not only match the musical content but also reflect the desired stylistic characteristics. To enable flexible control over the generated dances, we introduce a spatial-temporal masking mechanism. As controllable dance generation has not been fully studied, we construct corresponding experimental setups and benchmarks for tasks such as trajectory-based dance generation, dance in-betweening, and dance inpainting. Extensive experiments demonstrate that our approach can generate realistic and stylistically consistent dances, while also empowering users to create dances tailored to diverse artistic and practical needs. Code is available on Github: https://github.com/mucunzhuzhu/DGSDP

Controllable Dance Generation with Style-Guided Motion Diffusion

TL;DR

Abstract

Paper Structure (14 sections, 8 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 14 sections, 8 equations, 10 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminaries
Style-Guided Motion Diffusion
Spatial-Temporal Masking
Prompts of Dance Style
Experiments
Dataset and Evaluation Metrics
Experimental Setup
Evaluation of Dance Generation
Ablation Study and Analysis
Limitations and Applications
Conclusion

Figures (10)

Figure 1: The illustration of controllable dance generation with style prompts and user-provided constraints. We use a diffusion-based model as an example, with controlled constraints serving as guidance for the diffusion process.
Figure 2: The proposed Style-Guided Motion Diffusion (SGMD). The model is fed with a noisy motion sequence $X_T$ of length $N$ in a noising step $T$, along with conditioning music $c$, style prompts $s$ and $T$ itself. The music embedding serves as input for the cross-attention module. During the inference process, SGMD samples noise $X_T$ given the conditions $c$ and $s$ to predict the clean sample $\hat{X}$. Subsequently, it diffuses this sample back to $X_{T-1}$, and repeats this iterative process until $t=0$ is reached.
Figure 3: Controllable dance generation with spatial-temporal masking. For the known sequence, we add noise on it to obtain noisy sequence at timestep $t-1$ directly. For unknown sequences, we first use the trained network ${\hat{x}}_{\theta}$ to predict the motion at the timestep 0, and then we add noise on it to obtain noisy sequence at timestep $t-1$ . The mask is two-dimensional and allows for control in both the temporal dimension and spatial dimension.
Figure 4: User study interface of dance generation and the corresponding results.
Figure 5: Visualization of generated dances for the same piece of music. We list four music genres, each associates with five different dance movements. Generated dance movements are in blue and real movements are in grey.
...and 5 more figures

Controllable Dance Generation with Style-Guided Motion Diffusion

TL;DR

Abstract

Controllable Dance Generation with Style-Guided Motion Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (10)