VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Yan-Bo Lin; Yu Tian; Linjie Yang; Gedas Bertasius; Heng Wang

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang

TL;DR

This work develops a generative video-music Transformer with a novel semantic video-music alignment scheme that outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation.

Abstract

We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation. Results are available at https://genjib.github.io/project_page/VMAs/index.html

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 8 figures, 7 tables)

This paper contains 24 sections, 4 equations, 8 figures, 7 tables.

Introduction
Related Work
Text-Conditioned Audio and Music Generation
Video-to-Music Generation
The DISCO-MV Dataset
Technical Approach
Audio and Video Inputs
Efficient Video Encoder
Autoregressive Music Generation
Semantic Video-Music Alignment
Experimental Setup
Downstream Datasets
Evaluation Metrics
Baselines
Results and Analysis
...and 9 more sections

Figures (8)

Figure 1: Our Video-to-Music Generation Framework. Our video-to-music generation framework consists of three main components: 1) an efficient video encoder for capturing fine-grained temporal cues from many densely sampled video frames, 2) an autoregressive music decoder for generating output audio tokens, and 3) a novel video-music alignment scheme that integrates a contrastive training objective and a novel video-beat alignment scheme, ensuring that the generated music exhibits high-level and low-level alignment with the video content.
Figure 2: Video-Beat Alignment Scheme. Our proposed alignment scheme allows us to detect moments in the video where music beats align with low-level visual cues such as dynamic human motions or scene transitions. We use Onset Detection onset and Optical Flow to identify such aligned video-beat moments. This information is then used to supervise our video-music generation model such that it would produce music aligned with low-level dynamic visual content.
Figure 3: Human Evaluation. We conduct human evaluation to compare our VMAs method against several recent video-music generation methods acmmm21_cmtarxiv23_suitable_video2musicmusicgenaaai24_v2meow. We present the results as average human preference ratings for 1) the overall music generation quality and 2) the alignment between generated music and the corresponding video content. Each comparison is conducted between a pair of methods. The methods are unknown to the human raters. Our results indicate that human subjects consistently prefer music-video samples with music generated by our method.
Figure 4: Human Evaluation. Human raters are asked to select the generated music that best aligns with a given video and the best music quality. We report the average human preference rate for each method. Note that all samples are present in a random order.
Figure 5: Music Genre Distribution. We present the GTZAN genres gtzan for the DISCO-MV dataset. Genres are assigned to each soundtrack based on the maximum cosine similarity between its sound embedding and the corresponding genre (text) embedding.
...and 3 more figures

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

TL;DR

Abstract

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (8)