Table of Contents
Fetching ...

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov

TL;DR

AV-Link presents a unified cross-modal diffusion framework for V2A and A2V that leverages activations from frozen flow-based video and audio generators. A novel Fusion Block enables bidirectional conditioning with time-aligned self-attention and symmetric reinjection, guided by time-aware RoPE for robust temporal alignment. Empirical results show substantial improvements in audio-video synchronization and competitive semantic quality, outperforming baselines like MovieGen on temporal alignment. The approach uses a compact 186M parameter footprint and avoids task-specific pretrained feature extractors, highlighting the effectiveness of diffusion activations as cross-modal conditioning signals.

Abstract

We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive automatic and subjective evaluations demonstrate that our method achieves a substantial improvement in audio-video synchronization, outperforming more expensive baselines such as the MovieGen video-to-audio model.

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

TL;DR

AV-Link presents a unified cross-modal diffusion framework for V2A and A2V that leverages activations from frozen flow-based video and audio generators. A novel Fusion Block enables bidirectional conditioning with time-aligned self-attention and symmetric reinjection, guided by time-aware RoPE for robust temporal alignment. Empirical results show substantial improvements in audio-video synchronization and competitive semantic quality, outperforming baselines like MovieGen on temporal alignment. The approach uses a compact 186M parameter footprint and avoids task-specific pretrained feature extractors, highlighting the effectiveness of diffusion activations as cross-modal conditioning signals.

Abstract

We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive automatic and subjective evaluations demonstrate that our method achieves a substantial improvement in audio-video synchronization, outperforming more expensive baselines such as the MovieGen video-to-audio model.

Paper Structure

This paper contains 24 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Compared to current Video-to-Audio and Audio-to-Video methods, AV-Link provides a unified framework for these two tasks. Rather than relying on feature extractors pretrained for other tasks (e.g. CLIP clip, CLAP clap), we directly leverage the activations from pretrained frozen Flow Matching models using a Fusion Block to achieve precise time alignment between modalities. Our approach offers competitive semantic alignment and improved temporal alignment in a self-contained framework for both modalities.
  • Figure 2: Design of the proposed Fusion Block connecting the frozen video and audio backbones. A RoPE-based temporal alignment mechanism aligns the representation of the two modalities which are processed by self attention. Video and audio features are symmetrically reinjected into the frozen generators. The block is regularly applied multiple times throughout the backbones.
  • Figure 3: Visualization of Audio-to-Video and Video-to-Audio generation performance for various value of flow timesteps $t$ for conditioning features. Best performance is achieved when conditioning features are close to be fully denoised, i.e.$t \in [0.8, 0.98]$.
  • Figure 4: Qualitative V2A results. Our model achieved the best temporal alignment, matching closely the "bouncing" and "drumming" sounds entailed by the video modality. See the Appendix and Website for additional results.
  • Figure 5: Qualitative results of A2V generation. Our model generates semantically and temporally aligned content capturing temporal events implied by the audio modality such as explosions and printing. See the Website for additional results.
  • ...and 3 more figures