Table of Contents
Fetching ...

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, Timo I. Denk

TL;DR

V2Meow addresses the challenge of generating high-fidelity music conditioned on arbitrary video input by learning a general video–audio mapping rather than relying on domain-specific cues. It introduces a three-stage, autoregressive pipeline that maps visual features to semantic music tokens and then to coarse-to-fine acoustic tokens, enabling waveform reconstruction conditioned on video frames and optional MuLan text prompts. The model achieves strong audiovisual correspondence and audio quality in zero-shot settings across MV100K, MusicCaps, and AIST++ dance videos, outperforming MIDI-based and text-only baselines. The work demonstrates that decoupling semantic and acoustic modeling and leveraging pre-trained visual and text representations yields robust video-conditioned music generation with controllable style and broad generalization.

Abstract

Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

TL;DR

V2Meow addresses the challenge of generating high-fidelity music conditioned on arbitrary video input by learning a general video–audio mapping rather than relying on domain-specific cues. It introduces a three-stage, autoregressive pipeline that maps visual features to semantic music tokens and then to coarse-to-fine acoustic tokens, enabling waveform reconstruction conditioned on video frames and optional MuLan text prompts. The model achieves strong audiovisual correspondence and audio quality in zero-shot settings across MV100K, MusicCaps, and AIST++ dance videos, outperforming MIDI-based and text-only baselines. The work demonstrates that decoupling semantic and acoustic modeling and leveraging pre-trained visual and text representations yields robust video-conditioned music generation with controllable style and broad generalization.

Abstract

Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.
Paper Structure (25 sections, 12 figures, 4 tables)

This paper contains 25 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The video-to-music generation model V2Meow synthesizes high-fidelity music conditioned on video input and optionally text describing high-level style.
  • Figure 2: V2Meow Architecture Overview: (left) Feature extraction pipeline for video, audio and text representations. (right) Overview of multi-stage video to music modeling.
  • Figure 3: (a) Ablation study on the contribution of each component for MV100K dataset using FAD VGGish score, the lower the better. (b) Ablation study on the contribution of each component of for MusicCaps dataset using MCC score between text and generated audio, the higher the better.
  • Figure 4: Videos Included in Human Study: (left) Example video inputs included in the MV100K test set. (right) Example video inputs included in the MusicCaps test set. Two frames are shown for each video.
  • Figure 5: Out-of-Domain Example Analysis: Here we show melspectrograms of 3 music audios generated from the same cat video clip, which features an out-of-domain object, i.e., cat and an audio event, i.e., cat starts to eat at t=7s.
  • ...and 7 more figures