Table of Contents
Fetching ...

GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions

Heda Zuo, Weitao You, Junxian Wu, Shihong Ren, Pei Chen, Mingxu Zhou, Yujia Lu, Lingyun Sun

TL;DR

This work addresses the challenge of generating music that closely matches diverse video content, aiming for strong cross-modal alignment and diverse, universal output. It introduces GVMGen, a generalized video-to-music generation model that employs hierarchical attentions—spatial self-attention, spatial cross-attention with trainable music queries, and temporal cross-attention—coupled with a MusicGen-based decoder and Encodec for audio synthesis. The authors also propose an evaluation model with global cross-modal relevance and local temporal alignment metrics and curate a large-scale, multi-style video-music dataset, including Chinese traditional music, to promote diversity and realism. Empirical results show that GVMGen outperforms state-of-the-art baselines in music-video correspondence, diversity, and universal applicability, including zero-shot scenarios.

Abstract

Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.

GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions

TL;DR

This work addresses the challenge of generating music that closely matches diverse video content, aiming for strong cross-modal alignment and diverse, universal output. It introduces GVMGen, a generalized video-to-music generation model that employs hierarchical attentions—spatial self-attention, spatial cross-attention with trainable music queries, and temporal cross-attention—coupled with a MusicGen-based decoder and Encodec for audio synthesis. The authors also propose an evaluation model with global cross-modal relevance and local temporal alignment metrics and curate a large-scale, multi-style video-music dataset, including Chinese traditional music, to promote diversity and realism. Empirical results show that GVMGen outperforms state-of-the-art baselines in music-video correspondence, diversity, and universal applicability, including zero-shot scenarios.

Abstract

Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.
Paper Structure (16 sections, 7 equations, 3 figures, 5 tables)

This paper contains 16 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: General Video-to-Music Generation (GVMGen) model with encoder-decoder struture. The model consists of: (1) Visual feature extraction module with spatial self-attention; (2) Feature transformation module with spatial cross-attention; (3) Conditional Music generation module with temporal cross-attention.
  • Figure 2: Evaluation model with both Temporal Alignment (TA) and Cross-Modal Relevance (CMR), where $z_v$ and $z_m$ represent video features and music features.
  • Figure 3: Mel-spectrums of generated music by models according to the same video input in (b), (b) illustrates the pitch contours and alignment of the music generated by our model with the video input.