EDGE: Editable Dance Generation From Music
Jonathan Tseng, Rodrigo Castellon, C. Karen Liu
TL;DR
EDGE presents a diffusion based framework for editable dance generation conditioned on music that supports long form synthesis and flexible editing. It combines a transformer decoder with cross attention and Jukebox derived music features to produce physically plausible choreographies and enables constraint based in betweening and joint wise edits. The authors introduce a physically inspired metric PFC and a Contact Consistency Loss to improve foot ground contact realism, and show through large scale user studies that EDGE outperforms prior baselines on multiple criteria while maintaining beat alignment and diversity control. The work demonstrates strong generalization to in the wild music and provides a memory efficient Jukebox feature extraction pipeline, enabling practical editing for animation and interactive applications.
Abstract
Dance is an important human art form, but creating new dances can be difficult and time-consuming. In this work, we introduce Editable Dance GEneration (EDGE), a state-of-the-art method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to the input music. EDGE uses a transformer-based diffusion model paired with Jukebox, a strong music feature extractor, and confers powerful editing capabilities well-suited to dance, including joint-wise conditioning, and in-betweening. We introduce a new metric for physical plausibility, and evaluate dance quality generated by our method extensively through (1) multiple quantitative metrics on physical plausibility, beat alignment, and diversity benchmarks, and more importantly, (2) a large-scale user study, demonstrating a significant improvement over previous state-of-the-art methods. Qualitative samples from our model can be found at our website.
