Every Image Listens, Every Image Dances: Music-Driven Image Animation
Zhikang Dong, Weituo Hao, Ju-Chiang Wang, Peng Zhang, Pawel Polak
TL;DR
MuseDance tackles the problem of music-driven image animation by introducing an end-to-end diffusion-based framework that animates a static reference image to dance in sync with music while following a text-guided motion description. The method uses a two-stage training pipeline—appearance pretraining to fix appearance and dynamic trigger video generation to inject music, beat, and motion cues—coupled with a new music-dance dataset of 2,904 videos and 454 music tracks. It leverages a ReferenceNet within a latent diffusion architecture, cross-attention conditioning, and three dedicated modules (music understanding, beat alignment, motion alignment) to achieve temporally coherent, semantically controlled animation of both human and non-human objects. The results demonstrate robust generalization and flexible control, presenting MuseDance as a new baseline for music-guided image animation with practical applications in content creation, entertainment, and education.
Abstract
Image animation has become a promising area in multimodal research, with a focus on generating videos from reference images. While prior work has largely emphasized generic video generation guided by text, music-driven dance video generation remains underexplored. In this paper, we introduce MuseDance, an innovative end-to-end model that animates reference images using both music and text inputs. This dual input enables MuseDance to generate personalized videos that follow text descriptions and synchronize character movements with the music. Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences, making flexible and creative video generation accessible to users of all expertise levels. To advance research in this field, we present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new baseline for the music-driven image animation task.
