MV-Crafter: An Intelligent System for Music-guided Video Generation
Chuer Chen, Shengqi Dang, Yuqi Liu, Nanxuan Zhao, Yang Shi, Nan Cao
TL;DR
MV-Crafter addresses the challenge of generating high-quality, rhythmically synchronized music videos for non-professionals by integrating three modules: script generation guided by musical semantics via LLMs and music captions, diffusion-based video generation, and a dynamic, monotonic synchronization pipeline using beat matching and visual envelope warping. The system segments input music into clips, generates scene-level scripts with five style keywords, creates per-scene visuals via image-to-video diffusion, and aligns video beats with music beats through a DP-based warping function G(t) to avoid frame repetition. Extensive experiments and user studies show MV-Crafter outperforms two baselines in beat alignment and content–theme coherence, while achieving competitive visual quality and narrative flow, though human-made videos still set the benchmark for narrative coherence and stylistic consistency. The work demonstrates the practical potential of AI-assisted music video authoring for broader creators, while outlining limitations such as motion artifacts, longer generation times, and scene-level coherence that guide future improvements in adapters, control features, and faster video synthesis.
Abstract
Music videos, as a prevalent form of multimedia entertainment, deliver engaging audio-visual experiences to audiences and have gained immense popularity among singers and fans. Creators can express their interpretations of music naturally through visual elements. However, the creation process of music video demands proficiency in script design, video shooting, and music-video synchronization, posing significant challenges for non-professionals. Previous work has designed automated music video generation frameworks. However, they suffer from complexity in input and poor output quality. In response, we present MV-Crafter, a system capable of producing high-quality music videos with synchronized music-video rhythm and style. Our approach involves three technical modules that simulate the human creation process: the script generation module, video generation module, and music-video synchronization module. MV-Crafter leverages a large language model to generate scripts considering the musical semantics. To address the challenge of synchronizing short video clips with music of varying lengths, we propose a dynamic beat matching algorithm and visual envelope-induced warping method to ensure precise, monotonic music-video synchronization. Besides, we design a user-friendly interface to simplify the creation process with intuitive editing features. Extensive experiments have demonstrated that MV-Crafter provides an effective solution for improving the quality of generated music videos.
