Dance2MIDI: Dance-driven multi-instruments music generation

Bo Han; Yuheng Li; Yixuan Shen; Yi Ren; Feilin Han

Dance2MIDI: Dance-driven multi-instruments music generation

Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, Feilin Han

TL;DR

This work tackles dance-conditioned multi-instrument MIDI generation by creating the first large-scale paired dataset D2MIDI and a three-component generation framework, Dance2MIDI. The framework uses a two-branch Context Encoder to capture dance motion and style, a Transformer-based Drum Rhythm Generator with cross-attention to produce a base rhythm, and a BERT-like Multi-Track MIDI generator to complete remaining tracks in a self-supervised manner. Empirical results on AIST and D2MIDI demonstrate state-of-the-art performance on both objective measures of coherence and quality, and superior subjective Consistency, highlighting the dataset and method's effectiveness for cross-modal symbolic music generation. The approach offers a practical path toward automatic, dance-aware, multi-instrument soundtracks for diverse dance styles, with implications for creative AI and automated soundtrack generation.

Abstract

Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multi-instruments scenario is under-explored. The challenges associated with the dance-driven multi-instrument music (MIDI) generation are twofold: 1) no publicly available multi-instruments MIDI and video paired dataset and 2) the weak correlation between music and video. To tackle these challenges, we build the first multi-instruments MIDI and dance paired dataset (D2MIDI). Based on our proposed dataset, we introduce a multi-instruments MIDI generation framework (Dance2MIDI) conditioned on dance video. Specifically, 1) to capture the relationship between dance and music, we employ the Graph Convolutional Network to encode the dance motion. This allows us to extract features related to dance movement and dance style, 2) to generate a harmonious rhythm, we utilize a Transformer model to decode the drum track sequence, leveraging a cross-attention mechanism, and 3) we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task. A BERT-like model is employed to comprehend the context of the entire music piece through self-supervised learning. We evaluate the generated music of our framework trained on the D2MIDI dataset and demonstrate that our method achieves State-of-the-Art performance.

Dance2MIDI: Dance-driven multi-instruments music generation

TL;DR

Abstract

Paper Structure (25 sections, 4 equations, 2 figures, 2 tables)

This paper contains 25 sections, 4 equations, 2 figures, 2 tables.

Introduction
Background
Music Generation
Dance To Music
Symbolic Music Dataset
D2MIDI Dataset
Video Crawling and Selection
MIDI Transcription and Annotation
Dance Motion Estimation
Statistics
Dance2MIDI Framework
Context Encoder
Music Representation
Drum Rhythm Generator
Multi-Track MIDI BERTGen
...and 10 more sections

Figures (2)

Figure 1: An overview of our proposed Dance2MIDI model. The dance video is input into the Mediapipe framework to extract the coordinates of the human body’s joint points. These coordinates are then used to encode the spatio-temporal features of dance movement and dance style (yellow block). Subsequently, these features serve as conditional information to guide the generation of multi-instrument MIDI music sequences (green block).
Figure 2: Visualization result. For the given dance video input, Dance2MIDI generates corresponding MIDI music and converts it into a waveform. The music beat is detected using the public toolbox Librosa. Two pieces of the dance videos are examples, where the blue box indicates the real dance beat (the turning point of the dance motion), and the red box indicates the frame of the dance video corresponding to the timestamp of our audio beat.

Dance2MIDI: Dance-driven multi-instruments music generation

TL;DR

Abstract

Dance2MIDI: Dance-driven multi-instruments music generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)