AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov
TL;DR
AV-Link presents a unified cross-modal diffusion framework for V2A and A2V that leverages activations from frozen flow-based video and audio generators. A novel Fusion Block enables bidirectional conditioning with time-aligned self-attention and symmetric reinjection, guided by time-aware RoPE for robust temporal alignment. Empirical results show substantial improvements in audio-video synchronization and competitive semantic quality, outperforming baselines like MovieGen on temporal alignment. The approach uses a compact 186M parameter footprint and avoids task-specific pretrained feature extractors, highlighting the effectiveness of diffusion activations as cross-modal conditioning signals.
Abstract
We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive automatic and subjective evaluations demonstrate that our method achieves a substantial improvement in audio-video synchronization, outperforming more expensive baselines such as the MovieGen video-to-audio model.
