Table of Contents
Fetching ...

YingVideo-MV: Music-Driven Multi-Stage Video Generation

Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding, Zihao Chen

TL;DR

YingVideo-MV tackles music-driven long-video generation by introducing a cascaded pipeline that first performs global shot planning via MV-Director and then generates high-fidelity, lip-synced portrait clips with a temporal-aware diffusion Transformer. It integrates explicit camera control through Plücker-encoded poses via a camera adapter and uses a dynamic window inference strategy to maintain continuity across long sequences. Training leverages Direct Preference Optimization to align outputs with human preferences while preserving diffusion stability. A large Music-in-the-Wild dataset supports diverse, high-quality results. The approach achieves superior audiovisual synchronization, identity consistency, and cinematographic control compared with strong baselines, and demonstrates practical potential for automated music video production.

Abstract

While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .

YingVideo-MV: Music-Driven Multi-Stage Video Generation

TL;DR

YingVideo-MV tackles music-driven long-video generation by introducing a cascaded pipeline that first performs global shot planning via MV-Director and then generates high-fidelity, lip-synced portrait clips with a temporal-aware diffusion Transformer. It integrates explicit camera control through Plücker-encoded poses via a camera adapter and uses a dynamic window inference strategy to maintain continuity across long sequences. Training leverages Direct Preference Optimization to align outputs with human preferences while preserving diffusion stability. A large Music-in-the-Wild dataset supports diverse, high-quality results. The approach achieves superior audiovisual synchronization, identity consistency, and cinematographic control compared with strong baselines, and demonstrates practical potential for automated music video production.

Abstract

While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .

Paper Structure

This paper contains 16 sections, 5 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Conditioned on a portrait image, text, and music input, YingVideo-MV can generate and edit portraits with high identity consistency, expressive facial features, natural body dynamics, and camera movement. The results demonstrate vivid emotions, rich camera movements, and precise lip-syncing, while also generating different artistic styles.
  • Figure 2: Illustration of YingVideo-MV's cascaded generation Pipeline. Our framework integrates multimodal inputs (music, text, and images) to enable segmented generation of music-performing portrait videos under the guidance of a global planning module. The planning agent strategically invokes specialized tools according to sub-task requirements, ultimately generating three core outputs conditioned on initial-frame specifications: (1) high fidelity music-performing portrait images, (2) coherent dynamic camera trajectories, and (3) synchronized audio sequences aligned with visual performance cues.
  • Figure 3: Illustration of Video Generation Model Architecture. Embeddings from the image and text encoders are injected into each block of the DiT. Given music audio input, we leverage Wav2Vec to extract audio embeddings, while the camera trajectory is encoded and incorporated into the diffusion latent. To model the joint audio–latent representation, audio embeddings are fed into an audio adapter, the outputs of which are injected into the DiT via cross-attention.
  • Figure 4: Timestep-aware dynamic window range strategy. From top to bottom, each row represents a denoising process at one timestep. Within each row, each clip of different color represents the segmentation of the long video. There are overlapping areas between each clip. At t=3, the last clip expands its overlap with the preceding clip to satisfy the minimum clip-length constraint. At t=5, the starting offset is reset because the offset accumulated in the previous timestep has reached its maximum allowable value.
  • Figure 5: Visualization of Camera Movement. This figure illustrates the music-driven high performance of our framework with synchronized camera motions. The generated sequences demonstrate precise alignment between body movements and camera motion.
  • ...and 1 more figures