Dance-to-Music Generation with Encoder-based Textual Inversion

Sifei Li; Weiming Dong; Yuxin Zhang; Fan Tang; Chongyang Ma; Oliver Deussen; Tong-Yee Lee; Changsheng Xu

Dance-to-Music Generation with Encoder-based Textual Inversion

Sifei Li, Weiming Dong, Yuxin Zhang, Fan Tang, Chongyang Ma, Oliver Deussen, Tong-Yee Lee, Changsheng Xu

TL;DR

This work develops dual-path rhythm-genre inversion to effectively integrate the rhythm and genre of a dance motion sequence into the textual space of a text-to-music model, and demonstrates that this approach outperforms state-of-the-art methods across multiple evaluation metrics.

Abstract

The seamless integration of music with dance movements is essential for communicating the artistic intent of a dance piece. This alignment also significantly improves the immersive quality of gaming experiences and animation productions. Although there has been remarkable advancement in creating high-fidelity music from textual descriptions, current methodologies mainly focus on modulating overall characteristics such as genre and emotional tone. They often overlook the nuanced management of temporal rhythm, which is indispensable in crafting music for dance, since it intricately aligns the musical beats with the dancers' movements. Recognizing this gap, we propose an encoder-based textual inversion technique to augment text-to-music models with visual control, facilitating personalized music generation. Specifically, we develop dual-path rhythm-genre inversion to effectively integrate the rhythm and genre of a dance motion sequence into the textual space of a text-to-music model. Contrary to traditional textual inversion methods, which directly update text embeddings to reconstruct a single target object, our approach utilizes separate rhythm and genre encoders to obtain text embeddings for two pseudo-words, adapting to the varying rhythms and genres. We collect a new dataset called In-the-wild Dance Videos (InDV) and demonstrate that our approach outperforms state-of-the-art methods across multiple evaluation metrics. Furthermore, our method is able to adapt to changes in tempo and effectively integrates with the inherent text-guided generation capability of the pre-trained model. Our source code and demo videos are available at \url{https://github.com/lsfhuihuiff/Dance-to-music_Siggraph_Asia_2024}

Dance-to-Music Generation with Encoder-based Textual Inversion

TL;DR

Abstract

Paper Structure (28 sections, 7 equations, 4 figures, 5 tables)

This paper contains 28 sections, 7 equations, 4 figures, 5 tables.

Introducion
Related Work
Audio-video synchronization
Text-to-music generation
Personalization of generative models
Method
Encoder-based Textual Inversion
Rhythm Encoder
Genre Encoder
Experiments
Experimental Setup
Dataset
Implementation details
Evaluation Metrics
Rhythm
...and 13 more sections

Figures (4)

Figure 1: We employ various pre-trained music generative models as the generative backbone and propose an encoder-based textual inversion method. During training, we fix the prompt as "a @ music with * as the rhythm", where "@" and "*" respectively represent the placeholders for the genre and rhythm of the input dance. Our dual-path rhythm-genre inversion optimizes the rhythm encoder and genre encoder together during training. Parameters $v_i$, $v_{i@}$, and $v_{i*}$ correspond to the text embeddings of the prompt, "@", and "*", respectively.
Figure 2: Music visualization examples. "GT" and "Ours" represents "Ground Truth" and "Ours (MUSICGEN)", respectively.
Figure 3: Qualitative examples of beat alignment. The time scale interval is 0.1s. The distribution of beats show that the majority of the generated beats closely match the ground truth, with offsets below 0.2s. The Riffusion-based model demonstrates slightly better beat alignment compared to MUSICGEN-based model, consistent with quantitative metrics.
Figure 4: Visualization of results for videos with changes in tempo. The frequency of audio peaks changes in accordance with the variations in video tempo and aligns with dance movements. The first row illustrates an example of slowed-down tempo, while the second row showcases an example of accelerated tempo.

Dance-to-Music Generation with Encoder-based Textual Inversion

TL;DR

Abstract

Dance-to-Music Generation with Encoder-based Textual Inversion

Authors

TL;DR

Abstract

Table of Contents

Figures (4)