Table of Contents
Fetching ...

Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

Guangtao Lyu, Chenghao Xu, Qi Liu, Jiexi Yan, Muli Yang, Fen Fang, Cheng Deng

TL;DR

This work tackles music-to-3D dance generation by arguing that tempo is a stable cue across datasets, unlike genre labels which are noisy and incomplete. It introduces TempoMoE, a tempo-aware mixture-of-experts module integrated into a diffusion-based motion generator, featuring tempo-structured expert groups and a hierarchical routing mechanism that selects and fuses experts across tempo bands and beat scales. The model achieves state-of-the-art performance on AIST++, FineDance, and PopDanceSet in motion quality and rhythm alignment, while offering efficient inference with lightweight features. Through extensive ablations and analyses, the authors demonstrate the benefits of tempo-based specialization, multi-scale beat modeling, and constrained hard/soft routing for robust tempo adaptation. Overall, TempoMoE provides a scalable, label-free approach to rhythm-aware dance generation with practical implications for real-time animation and content creation.

Abstract

Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.

Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

TL;DR

This work tackles music-to-3D dance generation by arguing that tempo is a stable cue across datasets, unlike genre labels which are noisy and incomplete. It introduces TempoMoE, a tempo-aware mixture-of-experts module integrated into a diffusion-based motion generator, featuring tempo-structured expert groups and a hierarchical routing mechanism that selects and fuses experts across tempo bands and beat scales. The model achieves state-of-the-art performance on AIST++, FineDance, and PopDanceSet in motion quality and rhythm alignment, while offering efficient inference with lightweight features. Through extensive ablations and analyses, the authors demonstrate the benefits of tempo-based specialization, multi-scale beat modeling, and constrained hard/soft routing for robust tempo adaptation. Overall, TempoMoE provides a scalable, label-free approach to rhythm-aware dance generation with practical implications for real-time animation and content creation.

Abstract

Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.

Paper Structure

This paper contains 40 sections, 18 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Visualization of dances under different tempos within the same genre. Even within a single genre, varying BPMs lead to distinct motion patterns: high BPM gives less time per beat, resulting in faster, more localized motions (e.g., quick arm swings, spins), while low BPM allows more time, supporting longer and more complex gestures (e.g., body turns, full-body transitions).
  • Figure 2: (a) BPM distributions across AIST++, FineDance, and PopDanceSet indicate that musical tempos predominantly fall within a shared range of 60–200 BPM, reflecting a common underlying rhythmic structure. (b–d) In contrast, genre distributions are highly imbalanced and dataset-specific: FineDance and AIST++ adopt distinct genre taxonomies, while PopDanceSet provides no explicit genre annotations. These insights motivate us to leverage BPM as a more reliable cue than genre labels.
  • Figure 3: Framework overview of TempoMoE, our tempo-aware dance generation diffusion model with $N$ transformer blocks. In addition to fusing music features via cross-attention, we replace the original FFN with TempoMoE, which adaptively activates tempo-specific expert groups containing multi-scale beat experts to synthesize coherent and rhythm-aligned 3D dance motion.
  • Figure 4: Sample-wise routing for slow (64.09 BPM) and fast (184.75 BPM) samples. Slow tempo engages low-BPM groups and transitions from quarter- to whole-beat experts to capture long-range motions, while fast tempo activates high-BPM groups and relies on quarter-beat experts for rapid, fine-grained movements.
  • Figure 5: Qualitative comparison of TempoMoE and a standard FFN baseline across three music genres. TempoMoE produces more diverse, expressive, and rhythmically coherent motions, highlighting the advantage of tempo-aware expert routing. See supplementary videos for details.
  • ...and 5 more figures