Table of Contents
Fetching ...

Every Image Listens, Every Image Dances: Music-Driven Image Animation

Zhikang Dong, Weituo Hao, Ju-Chiang Wang, Peng Zhang, Pawel Polak

TL;DR

MuseDance tackles the problem of music-driven image animation by introducing an end-to-end diffusion-based framework that animates a static reference image to dance in sync with music while following a text-guided motion description. The method uses a two-stage training pipeline—appearance pretraining to fix appearance and dynamic trigger video generation to inject music, beat, and motion cues—coupled with a new music-dance dataset of 2,904 videos and 454 music tracks. It leverages a ReferenceNet within a latent diffusion architecture, cross-attention conditioning, and three dedicated modules (music understanding, beat alignment, motion alignment) to achieve temporally coherent, semantically controlled animation of both human and non-human objects. The results demonstrate robust generalization and flexible control, presenting MuseDance as a new baseline for music-guided image animation with practical applications in content creation, entertainment, and education.

Abstract

Image animation has become a promising area in multimodal research, with a focus on generating videos from reference images. While prior work has largely emphasized generic video generation guided by text, music-driven dance video generation remains underexplored. In this paper, we introduce MuseDance, an innovative end-to-end model that animates reference images using both music and text inputs. This dual input enables MuseDance to generate personalized videos that follow text descriptions and synchronize character movements with the music. Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences, making flexible and creative video generation accessible to users of all expertise levels. To advance research in this field, we present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new baseline for the music-driven image animation task.

Every Image Listens, Every Image Dances: Music-Driven Image Animation

TL;DR

MuseDance tackles the problem of music-driven image animation by introducing an end-to-end diffusion-based framework that animates a static reference image to dance in sync with music while following a text-guided motion description. The method uses a two-stage training pipeline—appearance pretraining to fix appearance and dynamic trigger video generation to inject music, beat, and motion cues—coupled with a new music-dance dataset of 2,904 videos and 454 music tracks. It leverages a ReferenceNet within a latent diffusion architecture, cross-attention conditioning, and three dedicated modules (music understanding, beat alignment, motion alignment) to achieve temporally coherent, semantically controlled animation of both human and non-human objects. The results demonstrate robust generalization and flexible control, presenting MuseDance as a new baseline for music-guided image animation with practical applications in content creation, entertainment, and education.

Abstract

Image animation has become a promising area in multimodal research, with a focus on generating videos from reference images. While prior work has largely emphasized generic video generation guided by text, music-driven dance video generation remains underexplored. In this paper, we introduce MuseDance, an innovative end-to-end model that animates reference images using both music and text inputs. This dual input enables MuseDance to generate personalized videos that follow text descriptions and synchronize character movements with the music. Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences, making flexible and creative video generation accessible to users of all expertise levels. To advance research in this field, we present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new baseline for the music-driven image animation task.

Paper Structure

This paper contains 15 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: MuseDance generates a dancing video from a reference image, synchronizing movements to the provided music, aligning with the beats, and visually interpreting the guidance of a text prompt for a seamless, music-driven animation.
  • Figure 2: In the first training stage, we train the model to capture spatial information by generating individual frames, with reference and target frames randomly sampled from a short time window. DensePose is used to help the model focus on the object, while text prompts assist in understanding motion. In the second training stage, we freeze the spatial attention blocks to preserve the model’s frame generation ability and introduce music, beat, and motion modules to incorporate music dynamics, align with the beat, and improve frame-to-frame consistency.
  • Figure 3: An example of textual data generation, we provide a series of frames and a detailed prompt to instruct GPT-4o to generate motion captions.
  • Figure 4: Music driven dancing video generation on non-human objects.
  • Figure 5: Dance video generations with the same text prompt but different reference images and music inputs. Frames are shown at the same time points from both the generated videos and the ground truth.