Table of Contents
Fetching ...

Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

Serkan Sulun, Paula Viana, Matthew E. P. Davies

TL;DR

This paper introduces EMSYNC, a two-stage system that automatically generates MIDI-based video soundtracks by aligning emotional content and scene boundaries. It fuses a pretrained video emotion classifier with an event-based, transformer-based MIDI generator conditioned via boundary offsets and a mapping from discrete Ekman emotions to continuous valence-arousal cues. Temporal boundary conditioning enables the model to anticipate scene cuts and place long-duration chords accordingly, while an emotion-mapping scheme allows integration of multimodal emotion data from different representations. Empirical results on EmoMV-C and Ads datasets show EMSYNC achieving superior objective alignment metrics and stronger subjective preference, highlighting its potential to streamline video production with emotionally and temporally synchronized music. The approach advances practical video-to-MIDI generation by enabling flexible, editable MIDI outputs and robust cross-domain generalization.

Abstract

Providing soundtracks for videos remains a costly and time-consuming challenge for multimedia content creators. We introduce EMSYNC, an automatic video-based symbolic music generator that creates music aligned with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate upcoming video scene cuts and align generated musical chords with them. We also propose a mapping scheme that bridges the discrete categorical outputs of the video emotion classifier with the continuous valence-arousal inputs required by the emotion-conditioned MIDI generator, enabling seamless integration of emotion information across different representations. Our method outperforms state-of-the-art models in objective and subjective evaluations across different video datasets, demonstrating its effectiveness in generating music aligned to video both emotionally and temporally. Our demo and output samples are available at https://serkansulun.com/emsync.

Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

TL;DR

This paper introduces EMSYNC, a two-stage system that automatically generates MIDI-based video soundtracks by aligning emotional content and scene boundaries. It fuses a pretrained video emotion classifier with an event-based, transformer-based MIDI generator conditioned via boundary offsets and a mapping from discrete Ekman emotions to continuous valence-arousal cues. Temporal boundary conditioning enables the model to anticipate scene cuts and place long-duration chords accordingly, while an emotion-mapping scheme allows integration of multimodal emotion data from different representations. Empirical results on EmoMV-C and Ads datasets show EMSYNC achieving superior objective alignment metrics and stronger subjective preference, highlighting its potential to streamline video production with emotionally and temporally synchronized music. The approach advances practical video-to-MIDI generation by enabling flexible, editable MIDI outputs and robust cross-domain generalization.

Abstract

Providing soundtracks for videos remains a costly and time-consuming challenge for multimedia content creators. We introduce EMSYNC, an automatic video-based symbolic music generator that creates music aligned with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate upcoming video scene cuts and align generated musical chords with them. We also propose a mapping scheme that bridges the discrete categorical outputs of the video emotion classifier with the continuous valence-arousal inputs required by the emotion-conditioned MIDI generator, enabling seamless integration of emotion information across different representations. Our method outperforms state-of-the-art models in objective and subjective evaluations across different video datasets, demonstrating its effectiveness in generating music aligned to video both emotionally and temporally. Our demo and output samples are available at https://serkansulun.com/emsync.

Paper Structure

This paper contains 18 sections, 2 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Our music generator. Numbers underneath valence and arousal are sample input values for illustration. Input tokens consist of discrete melodic events (e.g., ON, OFF, and TIMESHIFT) and special functional tokens (e.g., START, BAR, and CHORD). Boundary offsets are a sequence of scalar values, illustrated as the heights of the green bars. The plus sign indicates tensor addition. The dashed magenta boxes indicate concatenation. In sequence concatenation, the encoded valence and arousal vectors with shape (1, D) and the embedded input token sequence with shape (L, D) are concatenated along the first dimension, resulting in a tensor of shape (L+2, D). In feature concatenation, the encoded boundary offsets and positional embeddings, both with shape (L+2, D/2), are concatenated along the second dimension, producing a tensor of shape (L+2, D).
  • Figure 2: Video-based music generation pipeline. The text and the image next to the arrows demonstrate sample values for illustration.
  • Figure 3: Illustration of boundaries. Top: In symbolic music, a chord with three or more simultaneous notes and a duration exceeding a set threshold defines a musical boundary and is used during training. As CHORD is a special token without duration or pitch, it is omitted from the pianoroll representation. Its temporal location, however, aligns with the red line indicating the boundary. Middle: In video, scene cuts serve as video boundaries and are used during inference. Bottom: Boundary offset represents the temporal distance to the next boundary. Each offset is a scalar value, visualized as the height of the corresponding green bar. These figures are illustrative; offsets do not perfectly align with the music except at the boundary.
  • Figure 4: Sample view of the survey used in the subjective evaluation. Participants were asked to drag the method boxes from the left onto the ranking boxes on the right.
  • Figure 5: Distributions of rankings assigned by the participants in the subjective evaluation using EmoMV-C (top) and Ads (bottom) datasets.