Table of Contents
Fetching ...

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, Zhou Zhao

TL;DR

This paper introduces a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases, and demonstrates that the flow-matching-based music generator has in-context learning ability, allowing it to control the style and genre of the generated music.

Abstract

Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives. This paper presents MuVi, a novel framework that effectively addresses these challenges to enhance the cohesion and immersive experience of audio-visual content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. These features are used to generate music that not only matches the video's mood and theme but also its rhythm and pacing. We also introduce a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases. In addition, we demonstrate that our flow-matching-based music generator has in-context learning ability, allowing us to control the style and genre of the generated music. Experimental results show that MuVi demonstrates superior performance in both audio quality and temporal synchronization. The generated music video samples are available at https://muvi-v2m.github.io.

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

TL;DR

This paper introduces a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases, and demonstrates that the flow-matching-based music generator has in-context learning ability, allowing it to control the style and genre of the generated music.

Abstract

Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives. This paper presents MuVi, a novel framework that effectively addresses these challenges to enhance the cohesion and immersive experience of audio-visual content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. These features are used to generate music that not only matches the video's mood and theme but also its rhythm and pacing. We also introduce a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases. In addition, we demonstrate that our flow-matching-based music generator has in-context learning ability, allowing us to control the style and genre of the generated music. Experimental results show that MuVi demonstrates superior performance in both audio quality and temporal synchronization. The generated music video samples are available at https://muvi-v2m.github.io.

Paper Structure

This paper contains 44 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The architecture of MuVi. The main model and the input/output are illustrated in the middle, where the visual encoder is frozen during the training stage. The visual compression strategies are listed on the left, where "CLS" indicates the CLS token of certain visual encoders, such as CLIP. The architecture of the diffusion Transformer is illustrated on the right.
  • Figure 2: Visualization of the attention distribution of Softmax aggregation. The yellower the patch, the more it is related to the generated music. We mask the video frames with the averaged attention scores. We transform the patches corresponding to the weights after applying Softmax into masks, and then adjust the colors of the masks accordingly. When the weights are smaller (close to 0.0), the mask appears bluer; conversely (close to 1.0), it appears yellower. This reflects the attention distribution of the adaptor.
  • Figure 3: Illustration of In-context Learning.
  • Figure 4: Results of different CFG scales.