Table of Contents
Fetching ...

EDGE: Editable Dance Generation From Music

Jonathan Tseng, Rodrigo Castellon, C. Karen Liu

TL;DR

EDGE presents a diffusion based framework for editable dance generation conditioned on music that supports long form synthesis and flexible editing. It combines a transformer decoder with cross attention and Jukebox derived music features to produce physically plausible choreographies and enables constraint based in betweening and joint wise edits. The authors introduce a physically inspired metric PFC and a Contact Consistency Loss to improve foot ground contact realism, and show through large scale user studies that EDGE outperforms prior baselines on multiple criteria while maintaining beat alignment and diversity control. The work demonstrates strong generalization to in the wild music and provides a memory efficient Jukebox feature extraction pipeline, enabling practical editing for animation and interactive applications.

Abstract

Dance is an important human art form, but creating new dances can be difficult and time-consuming. In this work, we introduce Editable Dance GEneration (EDGE), a state-of-the-art method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to the input music. EDGE uses a transformer-based diffusion model paired with Jukebox, a strong music feature extractor, and confers powerful editing capabilities well-suited to dance, including joint-wise conditioning, and in-betweening. We introduce a new metric for physical plausibility, and evaluate dance quality generated by our method extensively through (1) multiple quantitative metrics on physical plausibility, beat alignment, and diversity benchmarks, and more importantly, (2) a large-scale user study, demonstrating a significant improvement over previous state-of-the-art methods. Qualitative samples from our model can be found at our website.

EDGE: Editable Dance Generation From Music

TL;DR

EDGE presents a diffusion based framework for editable dance generation conditioned on music that supports long form synthesis and flexible editing. It combines a transformer decoder with cross attention and Jukebox derived music features to produce physically plausible choreographies and enables constraint based in betweening and joint wise edits. The authors introduce a physically inspired metric PFC and a Contact Consistency Loss to improve foot ground contact realism, and show through large scale user studies that EDGE outperforms prior baselines on multiple criteria while maintaining beat alignment and diversity control. The work demonstrates strong generalization to in the wild music and provides a memory efficient Jukebox feature extraction pipeline, enabling practical editing for animation and interactive applications.

Abstract

Dance is an important human art form, but creating new dances can be difficult and time-consuming. In this work, we introduce Editable Dance GEneration (EDGE), a state-of-the-art method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to the input music. EDGE uses a transformer-based diffusion model paired with Jukebox, a strong music feature extractor, and confers powerful editing capabilities well-suited to dance, including joint-wise conditioning, and in-betweening. We introduce a new metric for physical plausibility, and evaluate dance quality generated by our method extensively through (1) multiple quantitative metrics on physical plausibility, beat alignment, and diversity benchmarks, and more importantly, (2) a large-scale user study, demonstrating a significant improvement over previous state-of-the-art methods. Qualitative samples from our model can be found at our website.
Paper Structure (46 sections, 11 equations, 11 figures, 4 tables)

This paper contains 46 sections, 11 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: EDGE generates diverse, physically plausible dance choreographies conditioned on music.
  • Figure 2: EDGE Pipeline Overview: EDGE learns to denoise dance sequences from time $t=T$ to $t=0$, conditioned on music. Music embedding information is provided by a frozen Jukebox model dhariwal2020jukebox and acts as cross-attention context. EDGE takes a noisy sequence $\bm{z}_T \sim \mathcal{N}(0, \bm{I})$ and produces the estimated final sequence $\bm{\hat{x}}$, noising it back to $\bm{\hat{z}}_{T-1}$ and repeating until $t=0$.
  • Figure 3: Although EDGE is trained on 5-second clips, it can generate choreographies of any length by imposing temporal constraints on batches of sequences. In this example, EDGE constrains the first 2.5 seconds of each sequence to match the last 2.5 seconds of the previous one to generate a 12.5-second clip, as represented by the temporal regions of distinct clips in the batch that share the same color.
  • Figure 4: EDGE allows the user to specify both temporal and joint-wise constraints. Constraint joints / frames are highlighted in green and tan, generated joints / frames are in blue and gray. Pictured, top to bottom: dance completion from seed motion, dance that hits a specified keyframe mid-choreography, completion from specified upper-body joint angles, completion from specified lower-body joint angles and root trajectory.
  • Figure 5: We plot $\text{FID}_k$ over the course of model training and find that it is inconsistent with overall quality evaluations.
  • ...and 6 more figures