Table of Contents
Fetching ...

VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features

Sifei Li, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming Dong, Chen Li

TL;DR

VidMusician tackles the challenge of generating background music that semantically and rhythmically aligns with arbitrary video content by extending a pre-trained text-to-music model with hierarchical visual conditioning. It introduces global semantic conditioning and local rhythmic conditioning, integrated via cross-attention and in-attention, respectively, and employs a two-stage training regime with zero/identity initializations to preserve generative capabilities while learning video correspondence. A new diverse dataset, DVMSet, supports robust evaluation across promo, commercial, and compilation videos, and VidMusician achieves state-of-the-art performance across objective metrics and user studies with only 24.79M trainable parameters. The approach offers a practical, scalable path for AI-assisted video production, enabling high-quality, semantically coherent and rhythmically aligned background music for a wide range of videos.

Abstract

Video-to-music generation presents significant potential in video production, requiring the generated music to be both semantically and rhythmically aligned with the video. Achieving this alignment demands advanced music generation capabilities, sophisticated video understanding, and an efficient mechanism to learn the correspondence between the two modalities. In this paper, we propose VidMusician, a parameter-efficient video-to-music generation framework built upon text-to-music models. VidMusician leverages hierarchical visual features to ensure semantic and rhythmic alignment between video and music. Specifically, our approach utilizes global visual features as semantic conditions and local visual features as rhythmic cues. These features are integrated into the generative backbone via cross-attention and in-attention mechanisms, respectively. Through a two-stage training process, we incrementally incorporate semantic and rhythmic features, utilizing zero initialization and identity initialization to maintain the inherent music-generative capabilities of the backbone. Additionally, we construct a diverse video-music dataset, DVMSet, encompassing various scenarios, such as promo videos, commercials, and compilations. Experiments demonstrate that VidMusician outperforms state-of-the-art methods across multiple evaluation metrics and exhibits robust performance on AI-generated videos. Samples are available at \url{https://youtu.be/EPOSXwtl1jw}.

VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features

TL;DR

VidMusician tackles the challenge of generating background music that semantically and rhythmically aligns with arbitrary video content by extending a pre-trained text-to-music model with hierarchical visual conditioning. It introduces global semantic conditioning and local rhythmic conditioning, integrated via cross-attention and in-attention, respectively, and employs a two-stage training regime with zero/identity initializations to preserve generative capabilities while learning video correspondence. A new diverse dataset, DVMSet, supports robust evaluation across promo, commercial, and compilation videos, and VidMusician achieves state-of-the-art performance across objective metrics and user studies with only 24.79M trainable parameters. The approach offers a practical, scalable path for AI-assisted video production, enabling high-quality, semantically coherent and rhythmically aligned background music for a wide range of videos.

Abstract

Video-to-music generation presents significant potential in video production, requiring the generated music to be both semantically and rhythmically aligned with the video. Achieving this alignment demands advanced music generation capabilities, sophisticated video understanding, and an efficient mechanism to learn the correspondence between the two modalities. In this paper, we propose VidMusician, a parameter-efficient video-to-music generation framework built upon text-to-music models. VidMusician leverages hierarchical visual features to ensure semantic and rhythmic alignment between video and music. Specifically, our approach utilizes global visual features as semantic conditions and local visual features as rhythmic cues. These features are integrated into the generative backbone via cross-attention and in-attention mechanisms, respectively. Through a two-stage training process, we incrementally incorporate semantic and rhythmic features, utilizing zero initialization and identity initialization to maintain the inherent music-generative capabilities of the backbone. Additionally, we construct a diverse video-music dataset, DVMSet, encompassing various scenarios, such as promo videos, commercials, and compilations. Experiments demonstrate that VidMusician outperforms state-of-the-art methods across multiple evaluation metrics and exhibits robust performance on AI-generated videos. Samples are available at \url{https://youtu.be/EPOSXwtl1jw}.

Paper Structure

This paper contains 26 sections, 6 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Real samples in DVMSet which covers a wide range of visual scenarios and music types.The first three rows display several video frames, while the last row shows the corresponding Mel-spectrogram segments of the music.
  • Figure 2: We employ an autoregressive model as the music (a) Generative Backbone and propose a method for semantic and rhythmic control via hierarchical visual features. The (b) Semantic Conditioning Module maps the CLIP global features into the text embedding space of the T5 model, which is fine-tuned using LoRA hu2021lora, and its output is incorporated into the generative backbone via a cross-attention mechanism. The (c) Rhythmic Conditioning Module captures spatial variations by computing inter-frame cosine distances of CLIP local features, and its output is incorporated into the generative backbone via an in-attention mechanism. "$\odot$" represents the cosine similarity calculation, $1 - sim$ represents the operation of subtracting the value of $sim$ from 1 to get $dis$, and "$\oplus$" represents element-wise addition. "Zero-Linear" and "Identity-Linear" refer to linear layers with zero and identity initialization techniques, respectively.
  • Figure 3: Inter-frame similarity is assessed using both global and local features, as indicated beneath each set of frames. The first row shows global similarity, while the second row depicts local similarity. In (a), (c), and (d), despite variations in the frames, the global similarity remains close to 1, whereas the local similarity varies with the degree of change. In (b), during transitions, both similarities decrease, but the local similarity exhibits a more pronounced decline. In (d), white boxes highlight the primary areas of change within the video frames.
  • Figure 4: Details of the Transformer block. Each block consists of four Transformer layers. Cross-attention, which incorporates semantic condition, is applied at every layer, whereas in-attention, which integrates rhythm condition, is applied only at the first layer of each block.