Video-based Music Generation

Serkan Sulun

Video-based Music Generation

Serkan Sulun

TL;DR

This work introduces EMSYNC, a fully automatic video-based music generation framework that synchronizes musical output with video emotion and pacing. It combines a novel video emotion classifier with a continuous-valued emotion-conditioned MIDI generator and a boundary-aware temporal conditioning mechanism to align chords with scene changes. A large-scale emotion-labeled MIDI dataset is constructed by leveraging Spotify-derived features and lyrics, enabling continuous valence-arousal conditioning and multi-instrument generation. The system also explores audio bandwidth extension to study robustness to synthetic data, revealing key generalization challenges and proposing data augmentation as a mitigation. Across objective tests and user studies, EMSYNC outperforms prior methods in musical richness, emotional alignment, and temporal synchronization, establishing new state-of-the-art in video-based music generation and offering open-source resources for the community.

Abstract

As the volume of video content on the internet grows rapidly, finding a suitable soundtrack remains a significant challenge. This thesis presents EMSYNC (EMotion and SYNChronization), a fast, free, and automatic solution that generates music tailored to the input video, enabling content creators to enhance their productions without composing or licensing music. Our model creates music that is emotionally and rhythmically synchronized with the video. A core component of EMSYNC is a novel video emotion classifier. By leveraging pretrained deep neural networks for feature extraction and keeping them frozen while training only fusion layers, we reduce computational complexity while improving accuracy. We show the generalization abilities of our method by obtaining state-of-the-art results on Ekman-6 and MovieNet. Another key contribution is a large-scale, emotion-labeled MIDI dataset for affective music generation. We then present an emotion-based MIDI generator, the first to condition on continuous emotional values rather than discrete categories, enabling nuanced music generation aligned with complex emotional content. To enhance temporal synchronization, we introduce a novel temporal boundary conditioning method, called "boundary offset encodings," aligning musical chords with scene changes. Combining video emotion classification, emotion-based music generation, and temporal boundary conditioning, EMSYNC emerges as a fully automatic video-based music generator. User studies show that it consistently outperforms existing methods in terms of music richness, emotional alignment, temporal synchronization, and overall preference, setting a new state-of-the-art in video-based music generation.

Video-based Music Generation

TL;DR

Abstract

Video-based Music Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (33)