Table of Contents
Fetching ...

Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Shulei Ji, Zihao Wang, Jiaxing Yu, Xiangyuan Yang, Shuyu Li, Songruoyao Wu, Kejun Zhang

TL;DR

Diff-V2M tackles video-to-music generation by explicitly modeling rhythm and fusing multi-view visual cues through a hierarchical conditional diffusion framework. It introduces three rhythmic representations and a rhythm predictor, and integrates emotional, semantic, and rhythmic features via hierarchical cross-attention with timestep-aware fusion. The approach, built on a DiT-based audio latent diffusion model, demonstrates state-of-the-art objective and subjective performance on in-domain and out-of-domain data, with efficient inference relative to several baselines. The work highlights the importance of rhythm conditioning for audiovisual alignment and provides robust training strategies to bridge training and inference gaps. Overall, Diff-V2M advances general V2M by combining rhythmic control, emotion/semantics conditioning, and diffusion-based generation for diverse video content.

Abstract

Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons. Demo and code are available at https://Tayjsl97.github.io/Diff-V2M-Demo/.

Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

TL;DR

Diff-V2M tackles video-to-music generation by explicitly modeling rhythm and fusing multi-view visual cues through a hierarchical conditional diffusion framework. It introduces three rhythmic representations and a rhythm predictor, and integrates emotional, semantic, and rhythmic features via hierarchical cross-attention with timestep-aware fusion. The approach, built on a DiT-based audio latent diffusion model, demonstrates state-of-the-art objective and subjective performance on in-domain and out-of-domain data, with efficient inference relative to several baselines. The work highlights the importance of rhythm conditioning for audiovisual alignment and provides robust training strategies to bridge training and inference gaps. Overall, Diff-V2M advances general V2M by combining rhythmic control, emotion/semantics conditioning, and diffusion-based generation for diverse video content.

Abstract

Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons. Demo and code are available at https://Tayjsl97.github.io/Diff-V2M-Demo/.

Paper Structure

This paper contains 40 sections, 11 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: The architecture of Diff-V2M, consisting of two core modules: (a) visual feature extraction that derives emotional, semantic, and rhythmic features; and (b) conditional music generation built on a DiT-based LDM, which integrates multi-view features via hierarchical cross attention and timestep-aware fusion strategies.
  • Figure 2: An example illustrating explicit video scene transitions, the visual rhythm curve, and the visual beats.
  • Figure 3: The illustration of different fusion strategies for semantic and rhythmic features.
  • Figure 4: The comparison of inference time for different methods in generating soundtracks for 30-second videos.
  • Figure 5: A/B test results of the subjective comparisons.