SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
Amir Dellali, Luca A. Lanzendörfer, Florian Grötschla, Roger Wattenhofer
TL;DR
SALSA-V tackles the challenge of producing tightly synchronized, long-form audio from silent video by introducing a masked flow matching framework with audio conditioning and a shortcut loss for fast sampling. The method combines a contrastively trained, high-resolution audiovisual synchronization backbone with masked in-/out-painting to extend audio beyond short clips, all without fine-tuning for speed. Empirical results show state-of-the-art synchronization and competitive audio quality, including a human listening study, and demonstrate robust few-step and long-form generation capabilities. The work enables practical, near-real-time video-to-audio synthesis with potential applications in Foley, sound design, and cinematic audio production.
Abstract
We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.
