Table of Contents
Fetching ...

MIDGET: Music Conditioned 3D Dance Generation

Jinwu Wang, Wei Mao, Miaomiao Liu

TL;DR

MIDGET presents a music-conditioned approach to 3D dance generation by marrying a Motion VQ-VAE-based memory codebook with a Transformer-powered Motion GPT. It introduces a gradient copying training strategy and Beat Align Loss to directly optimize beat-matching between music and motion, aided by a lightweight Music Feature Extractor. The method demonstrates state-of-the-art motion quality and music-beat alignment on AIST++ while offering more efficient training than diffusion-based baselines. The work contributes a practical, scalable pipeline for producing realistic, long-range 3D dances aligned with musical rhythm.

Abstract

In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.

MIDGET: Music Conditioned 3D Dance Generation

TL;DR

MIDGET presents a music-conditioned approach to 3D dance generation by marrying a Motion VQ-VAE-based memory codebook with a Transformer-powered Motion GPT. It introduces a gradient copying training strategy and Beat Align Loss to directly optimize beat-matching between music and motion, aided by a lightweight Music Feature Extractor. The method demonstrates state-of-the-art motion quality and music-beat alignment on AIST++ while offering more efficient training than diffusion-based baselines. The work contributes a practical, scalable pipeline for producing realistic, long-range 3D dances aligned with musical rhythm.

Abstract

In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.
Paper Structure (10 sections, 11 equations, 6 figures, 2 tables)

This paper contains 10 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Dance examples generated by our proposed method. Qualitative human motion generation samples based on our MIDGET model can be found at https://youtube.com/playlist?list=PLFUM19_jtCvR7ThXF6dyQCaGX3hmj416Z.
  • Figure 2: Overview of the MIDGET Model. Given a piece of music and its corresponding dance instructions, MIDGET can generate corresponding high-quality and smooth dance sequences.
  • Figure 3: 3D Dance Motion VQ-VAE. The main purpose of the VQ-VAE model is to obtain codebooks containing diverse quantified dance motion sequences. Learnable encoder and decoder to quantify features and reconstruct target poses.
  • Figure 4: The structure of Music Feature Extractor. Music features are further extracted through one-dimensional convolutional layers and residual connections.
  • Figure 5: Motion GPT Structure. The GPT model is designed to apply the encoded upper and lower body pose codes $a_u,a_l$ and music features $a_m$ to generate the target future motion probability $p^u, p^l$.
  • ...and 1 more figures