Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Jaeyong Kang; Soujanya Poria; Dorien Herremans

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Jaeyong Kang, Soujanya Poria, Dorien Herremans

TL;DR

This work tackles generating music that emotionally and structurally matches video content by introducing Video2Music, which conditions a novel Affective Multimodal Transformer (AMT) on rich video features and audio-derived chord information. It creates MuVi-Sync, a large multimodal dataset linking semantic, motion, scene offset, and emotion cues with chords, keys, loudness, and note density, enabling training of a chord-based video-to-music generator. The framework includes a dedicated affective matching loss and a Bi-GRU post-processing stage that maps predicted video attributes to note density and loudness, producing expressive MIDI through chord arpeggiation and velocity control. Objective metrics and a listening study indicate improved music-video correspondence and chord quality over baselines, with practical implications for automated, copyright-friendly soundtrack generation in multimedia production. The work also releases MuVi-Sync and code, supporting future research in melody extension, waveform-based generation, and advanced chord embeddings.

Abstract

Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene offset, motion, and emotion features. These distinct features are then employed as guiding input to our music generation model. We transcribe the audio files into MIDI and chords, and extract features such as note density and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on which we train a novel Affective Multimodal Transformer (AMT) model to generate music given a video. This model includes a novel mechanism to enforce affective similarity between video and music. Finally, post-processing is performed based on a biGRU-based regression model to estimate note density and loudness based on the video features. This ensures a dynamic rendering of the generated chords with varying rhythm and volume. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion. The musical quality, along with the quality of music-video matching is confirmed in a user study. The proposed AMT model, along with the new MuVi-Sync dataset, presents a promising step for the new task of music generation for videos.

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

TL;DR

Abstract

Paper Structure (36 sections, 12 equations, 14 figures, 6 tables)

This paper contains 36 sections, 12 equations, 14 figures, 6 tables.

Introduction
Related Work
Transformer-based Music Generation
Music Generation from Videos
Dataset Creation
Music Features
Note Density
Loudness
Chords
Key
Video Features
Semantic features
Emotion
Scene offset
Motion
...and 21 more sections

Figures (14)

Figure 1: Overview of our proposed Video2Music. In the training phase, we extract features from the audio file as well as the video frames and subsequently train the transformer model to predict chord sequences given video. We implemented two losses (chord loss and affective matching loss) to train the model. In the inference phase, the uploaded video and primer chords and key from the user are fed into the trained model to generate chord sequences. In the post-processing phase, we estimate note density and loudness from the input video and use them to synthesize the matching MIDI.
Figure 2: Example of how note density is calculated for each 1s time window. In this example, the estimated note density values are 5, 8, and 4 for the intervals 0s to 1s, 1s to 2s, and 2s to 3s, respectively.
Figure 3: Chord recognition and normalization procedure. Chord sequences, along with their respective start and end times, are identified from the audio file using a chord recognition model. Subsequently, the detected chords are reformatted to a one-chord-per-second representation. Depending on whether the recognized key is major or minor, the song's chords are transposed to either C major or A minor.
Figure 4: Top 30 normalized chords (to either the C major or A minor key) in our dataset.
Figure 5: Top 30 keys in our dataset (before chord normalization).
...and 9 more figures

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

TL;DR

Abstract

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Authors

TL;DR

Abstract

Table of Contents

Figures (14)