Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries
Serkan Sulun, Paula Viana, Matthew E. P. Davies
TL;DR
This paper introduces EMSYNC, a two-stage system that automatically generates MIDI-based video soundtracks by aligning emotional content and scene boundaries. It fuses a pretrained video emotion classifier with an event-based, transformer-based MIDI generator conditioned via boundary offsets and a mapping from discrete Ekman emotions to continuous valence-arousal cues. Temporal boundary conditioning enables the model to anticipate scene cuts and place long-duration chords accordingly, while an emotion-mapping scheme allows integration of multimodal emotion data from different representations. Empirical results on EmoMV-C and Ads datasets show EMSYNC achieving superior objective alignment metrics and stronger subjective preference, highlighting its potential to streamline video production with emotionally and temporally synchronized music. The approach advances practical video-to-MIDI generation by enabling flexible, editable MIDI outputs and robust cross-domain generalization.
Abstract
Providing soundtracks for videos remains a costly and time-consuming challenge for multimedia content creators. We introduce EMSYNC, an automatic video-based symbolic music generator that creates music aligned with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate upcoming video scene cuts and align generated musical chords with them. We also propose a mapping scheme that bridges the discrete categorical outputs of the video emotion classifier with the continuous valence-arousal inputs required by the emotion-conditioned MIDI generator, enabling seamless integration of emotion information across different representations. Our method outperforms state-of-the-art models in objective and subjective evaluations across different video datasets, demonstrating its effectiveness in generating music aligned to video both emotionally and temporally. Our demo and output samples are available at https://serkansulun.com/emsync.
