Table of Contents
Fetching ...

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

TL;DR

VidMuse tackles video-to-music generation using a simple yet effective Long-Short-Term Visual Module to fuse global and local video cues, producing music conditioned solely on visuals. It introduces the V2M dataset (360K training pairs plus finetuning and bench subsets) and demonstrates that end-to-end generation with a Music Token Decoder and an Audio Codec yields high-fidelity, semantically aligned music. Across objective metrics and user studies, VidMuse outperforms state-of-the-art baselines and shows strong generalization to other benchmarks, underscoring its versatility for diverse video genres. The work provides a scalable dataset, a robust architecture, and comprehensive analyses that advance audiovisual generation and retrieval applications.

Abstract

In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets are available at https://vidmuse.github.io/.

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

TL;DR

VidMuse tackles video-to-music generation using a simple yet effective Long-Short-Term Visual Module to fuse global and local video cues, producing music conditioned solely on visuals. It introduces the V2M dataset (360K training pairs plus finetuning and bench subsets) and demonstrates that end-to-end generation with a Music Token Decoder and an Audio Codec yields high-fidelity, semantically aligned music. Across objective metrics and user studies, VidMuse outperforms state-of-the-art baselines and shows strong generalization to other benchmarks, underscoring its versatility for diverse video genres. The work provides a scalable dataset, a robust architecture, and comprehensive analyses that advance audiovisual generation and retrieval applications.

Abstract

In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets are available at https://vidmuse.github.io/.
Paper Structure (25 sections, 2 equations, 9 figures, 7 tables)

This paper contains 25 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Dataset Construction. To ensure data quality, V2M goes through rule-based coarse filtering and content-based fine-grained filtering. Music source separation is applied to remove speech and singing signals in the audio. After processing, human experts curate the benchmark subset, while the remaining data is used as the pretraining dataset. The pretrain data is then refined using Audio-Visual Alignment Ranking to select the finetuning dataset.
  • Figure 2: Statistics of our dataset. (a) The distribution of video genres in our dataset, (b) Comparisons with other related datasets in terms of scale of datasets. Please zoom in for details.
  • Figure 3: Overview of the VidMuse Framework. This pipeline outlines the key components for generating music aligned with video content: (1) Visual Encoder for extracting visual features, (2) Long-Short-Term Visual Module for integrating local and global cues, (3) Music Token Decoder for generating music tokens, and (4) Audio Codec for the conversion between audio and audio tokens.
  • Figure 4: A/B test results of the user study. We design four criteria in Sec. \ref{['sec:user_study']} to assess the subjective performance.
  • Figure A1: Distribution of music genres in the dataset, showcasing the diverse representation of genres such as electronic, classical, and jazz.
  • ...and 4 more figures