YingVideo-MV: Music-Driven Multi-Stage Video Generation
Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding, Zihao Chen
TL;DR
YingVideo-MV tackles music-driven long-video generation by introducing a cascaded pipeline that first performs global shot planning via MV-Director and then generates high-fidelity, lip-synced portrait clips with a temporal-aware diffusion Transformer. It integrates explicit camera control through Plücker-encoded poses via a camera adapter and uses a dynamic window inference strategy to maintain continuity across long sequences. Training leverages Direct Preference Optimization to align outputs with human preferences while preserving diffusion stability. A large Music-in-the-Wild dataset supports diverse, high-quality results. The approach achieves superior audiovisual synchronization, identity consistency, and cinematographic control compared with strong baselines, and demonstrates practical potential for automated music video production.
Abstract
While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .
