Table of Contents
Fetching ...

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan

TL;DR

The results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation, and are validated via a large crowd-source subjective listening test.

Abstract

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

TL;DR

The results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation, and are validated via a large crowd-source subjective listening test.

Abstract

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/
Paper Structure (23 sections, 6 equations, 5 figures, 12 tables, 2 algorithms)

This paper contains 23 sections, 6 equations, 5 figures, 12 tables, 2 algorithms.

Figures (5)

  • Figure 1: Zero-Pair Video-to-Music GenerationTop: Generating music for video commonly requires large-scale collections of high-quality, paired video-music data. Middle: Our V2M-Zero method is trained only on text–music pairs with an additional music-event curve condition (no video). Bottom: At inference, we swap a music-event curve with aligned video-event curves extracted via off-the-shelf vision models and generate time-synchronized music to match the input video.
  • Figure 2: Shared Temporal Structure Across Modalities. Real event curves computed from video and music exhibit similar temporal patterns across diverse video scenarios. Ground-truth pairs have correlation $\approx$0.6, introducing random offsets degrades this to $\approx$0.2.
  • Figure 3: Method OverviewTop: During training, V2M-Zero learns a rectified-flow diffusion process conditioned on text prompts and a music-event curve derived from intra-music similarity. Bottom: At inference, music conditioning is swapped with a video-event curve based on framewise similarity, enabling zero-pair, time-synchronized video-to-music generation. For semantic alignment, a text prompt is predicted from the video and speech using the music captioner, Vibe vibe, without any joint-training.
  • Figure 4: Impact of Smoothing Kernel Size. Larger kernerls improve audio quality (FAD*) but temporal alignment (SCH) has an optimal point on OES-Pub.
  • Figure 5: Example event curves with different temporal dynamics. The blue solid curve corresponds to a video with frequent scene cuts, while the orange curve corresponds to a video with slower visual motion, showing distinct temporal structures. This supports our design choice of using event curves to represent relative timing, while text provides complementary semantic guidance.