Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

Guangtao Lyu; Chenghao Xu; Qi Liu; Jiexi Yan; Muli Yang; Fen Fang; Cheng Deng

Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

Guangtao Lyu, Chenghao Xu, Qi Liu, Jiexi Yan, Muli Yang, Fen Fang, Cheng Deng

TL;DR

This work tackles music-to-3D dance generation by arguing that tempo is a stable cue across datasets, unlike genre labels which are noisy and incomplete. It introduces TempoMoE, a tempo-aware mixture-of-experts module integrated into a diffusion-based motion generator, featuring tempo-structured expert groups and a hierarchical routing mechanism that selects and fuses experts across tempo bands and beat scales. The model achieves state-of-the-art performance on AIST++, FineDance, and PopDanceSet in motion quality and rhythm alignment, while offering efficient inference with lightweight features. Through extensive ablations and analyses, the authors demonstrate the benefits of tempo-based specialization, multi-scale beat modeling, and constrained hard/soft routing for robust tempo adaptation. Overall, TempoMoE provides a scalable, label-free approach to rhythm-aware dance generation with practical implications for real-time animation and content creation.

Abstract

Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.

Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

TL;DR

Abstract

Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)