Table of Contents
Fetching ...

LM2D: Lyrics- and Music-Driven Dance Synthesis

Wenjie Yin, Xuejiao Zhao, Yi Yu, Hang Yin, Danica Kragic, Mårten Björkman

TL;DR

LM2D, a novel probabilistic architecture that incorporates a multimodal diffusion model with consistency distillation, designed to create dance conditioned on both music and lyrics in one diffusion generation step is proposed.

Abstract

Dance typically involves professional choreography with complex movements that follow a musical rhythm and can also be influenced by lyrical content. The integration of lyrics in addition to the auditory dimension, enriches the foundational tone and makes motion generation more amenable to its semantic meanings. However, existing dance synthesis methods tend to model motions only conditioned on audio signals. In this work, we make two contributions to bridge this gap. First, we propose LM2D, a novel probabilistic architecture that incorporates a multimodal diffusion model with consistency distillation, designed to create dance conditioned on both music and lyrics in one diffusion generation step. Second, we introduce the first 3D dance-motion dataset that encompasses both music and lyrics, obtained with pose estimation technologies. We evaluate our model against music-only baseline models with objective metrics and human evaluations, including dancers and choreographers. The results demonstrate LM2D is able to produce realistic and diverse dance matching both lyrics and music. A video summary can be accessed at: https://youtu.be/4XCgvYookvA.

LM2D: Lyrics- and Music-Driven Dance Synthesis

TL;DR

LM2D, a novel probabilistic architecture that incorporates a multimodal diffusion model with consistency distillation, designed to create dance conditioned on both music and lyrics in one diffusion generation step is proposed.

Abstract

Dance typically involves professional choreography with complex movements that follow a musical rhythm and can also be influenced by lyrical content. The integration of lyrics in addition to the auditory dimension, enriches the foundational tone and makes motion generation more amenable to its semantic meanings. However, existing dance synthesis methods tend to model motions only conditioned on audio signals. In this work, we make two contributions to bridge this gap. First, we propose LM2D, a novel probabilistic architecture that incorporates a multimodal diffusion model with consistency distillation, designed to create dance conditioned on both music and lyrics in one diffusion generation step. Second, we introduce the first 3D dance-motion dataset that encompasses both music and lyrics, obtained with pose estimation technologies. We evaluate our model against music-only baseline models with objective metrics and human evaluations, including dancers and choreographers. The results demonstrate LM2D is able to produce realistic and diverse dance matching both lyrics and music. A video summary can be accessed at: https://youtu.be/4XCgvYookvA.
Paper Structure (22 sections, 12 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 22 sections, 12 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: LM2D, a multimodal framework generates realistic and diverse dance movements conditioned on both lyrics and music.
  • Figure 2: Overview of the LM2D framework. LM2D learns to denoise dance sequences from time $t=T$ to $t=0$, condition on music and lyrics in one step with consistency distillation.
  • Figure 3: Overview of the consistency models. Given a PF-ODE that smoothly converts real human motion to noisy motion, we learn to map any points on the trajectory to its origin point.
  • Figure 4: LM2D Example: Two dance sequences are generated from the same music but with different lyrics.
  • Figure 5: LM2D Example: Two dance sequences are generated from the same lyrics but with different music.
  • ...and 2 more figures