MIDGET: Music Conditioned 3D Dance Generation
Jinwu Wang, Wei Mao, Miaomiao Liu
TL;DR
MIDGET presents a music-conditioned approach to 3D dance generation by marrying a Motion VQ-VAE-based memory codebook with a Transformer-powered Motion GPT. It introduces a gradient copying training strategy and Beat Align Loss to directly optimize beat-matching between music and motion, aided by a lightweight Music Feature Extractor. The method demonstrates state-of-the-art motion quality and music-beat alignment on AIST++ while offering more efficient training than diffusion-based baselines. The work contributes a practical, scalable pipeline for producing realistic, long-range 3D dances aligned with musical rhythm.
Abstract
In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.
