MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation
Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, Jun He
TL;DR
MACE-Dance introduces a cascaded Mixture-of-Experts framework for music-driven dance video generation that separately optimizes motion realism and appearance fidelity. The Motion Expert uses a diffusion model with a BiMamba–Transformer backbone and Guidance-Free Training to convert music into kinematically plausible 3D motion, while the Appearance Expert applies a decoupled Kinematic–Aesthetic fine-tuning on top of Wan-Animate to render high-quality, coherent video from a reference image. A large MA-Data dataset and a motion–appearance evaluation protocol provide standardized benchmarking, with results demonstrating state-of-the-art performance on both 3D dance generation and pose-driven image animation tasks. The work shows that decoupling motion and appearance, coupled with specialized architectures and training strategies, yields substantial gains in both motion quality and visual fidelity for music-driven dance video generation.
Abstract
With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: https://macedance.github.io/
