Music to Dance as Language Translation using Sequence Models

André Correia; Luís A. Alexandre

Music to Dance as Language Translation using Sequence Models

André Correia, Luís A. Alexandre

TL;DR

Evaluation metrics, including Average Joint Error and Fr\'echet Inception Distance, consistently demonstrate that, when given a piece of music, MDLT excels at producing realistic and high-quality choreography.

Abstract

Synthesising appropriate choreographies from music remains an open problem. We introduce MDLT, a novel approach that frames the choreography generation problem as a translation task. Our method leverages an existing data set to learn to translate sequences of audio into corresponding dance poses. We present two variants of MDLT: one utilising the Transformer architecture and the other employing the Mamba architecture. We train our method on AIST++ and PhantomDance data sets to teach a robotic arm to dance, but our method can be applied to a full humanoid robot. Evaluation metrics, including Average Joint Error and Fréchet Inception Distance, consistently demonstrate that, when given a piece of music, MDLT excels at producing realistic and high-quality choreography. The code can be found at github.com/meowatthemoon/MDLT.

Music to Dance as Language Translation using Sequence Models

TL;DR

Abstract

Paper Structure (19 sections, 6 equations, 2 figures, 3 tables)

This paper contains 19 sections, 6 equations, 2 figures, 3 tables.

Introduction
Related Work
Preliminaries
Transformers
Structured State Space Sequence Models
Translation
Data Preparation
Data sets
Audio Features
Joint Angles
Synchronization
Music to Dance Translation
Transformer
Mamba
Experiments
...and 4 more sections

Figures (2)

Figure 1: Architecture of MDLT model variants. The audio features and poses first pass through their respective embedding layers. In the case of the Transformer variant (dashed left block) these embeddings are augmented with positional encoding. The encoder of the Transformer is conditioned on the audio features. The decoder of the Transformer is conditioned on the output of the encoder as well as the poses. The Mamba variant (dashed right block) receives both the audio and pose vectors. The embeddings of the final Mamba or Decoder block are projected to the pose dimensions. Finally, these values are activated using tanh and scaled to the joint angle range to produce the next poses. Only one of the dashed blocks is used.
Figure 2: Joint angle extraction from keypoints: First we obtain the arm and pelvis keypoints. Then we align the shoulders with the ground plane. Next, we align the spine vertically. Lastly, we extract the joint angles from the angles between the arm vectors.

Music to Dance as Language Translation using Sequence Models

TL;DR

Abstract

Music to Dance as Language Translation using Sequence Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)