VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features
Sifei Li, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming Dong, Chen Li
TL;DR
VidMusician tackles the challenge of generating background music that semantically and rhythmically aligns with arbitrary video content by extending a pre-trained text-to-music model with hierarchical visual conditioning. It introduces global semantic conditioning and local rhythmic conditioning, integrated via cross-attention and in-attention, respectively, and employs a two-stage training regime with zero/identity initializations to preserve generative capabilities while learning video correspondence. A new diverse dataset, DVMSet, supports robust evaluation across promo, commercial, and compilation videos, and VidMusician achieves state-of-the-art performance across objective metrics and user studies with only 24.79M trainable parameters. The approach offers a practical, scalable path for AI-assisted video production, enabling high-quality, semantically coherent and rhythmically aligned background music for a wide range of videos.
Abstract
Video-to-music generation presents significant potential in video production, requiring the generated music to be both semantically and rhythmically aligned with the video. Achieving this alignment demands advanced music generation capabilities, sophisticated video understanding, and an efficient mechanism to learn the correspondence between the two modalities. In this paper, we propose VidMusician, a parameter-efficient video-to-music generation framework built upon text-to-music models. VidMusician leverages hierarchical visual features to ensure semantic and rhythmic alignment between video and music. Specifically, our approach utilizes global visual features as semantic conditions and local visual features as rhythmic cues. These features are integrated into the generative backbone via cross-attention and in-attention mechanisms, respectively. Through a two-stage training process, we incrementally incorporate semantic and rhythmic features, utilizing zero initialization and identity initialization to maintain the inherent music-generative capabilities of the backbone. Additionally, we construct a diverse video-music dataset, DVMSet, encompassing various scenarios, such as promo videos, commercials, and compilations. Experiments demonstrate that VidMusician outperforms state-of-the-art methods across multiple evaluation metrics and exhibits robust performance on AI-generated videos. Samples are available at \url{https://youtu.be/EPOSXwtl1jw}.
