Table of Contents
Fetching ...

SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

Amir Dellali, Luca A. Lanzendörfer, Florian Grötschla, Roger Wattenhofer

TL;DR

SALSA-V tackles the challenge of producing tightly synchronized, long-form audio from silent video by introducing a masked flow matching framework with audio conditioning and a shortcut loss for fast sampling. The method combines a contrastively trained, high-resolution audiovisual synchronization backbone with masked in-/out-painting to extend audio beyond short clips, all without fine-tuning for speed. Empirical results show state-of-the-art synchronization and competitive audio quality, including a human listening study, and demonstrate robust few-step and long-form generation capabilities. The work enables practical, near-real-time video-to-audio synthesis with potential applications in Foley, sound design, and cinematic audio production.

Abstract

We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.

SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

TL;DR

SALSA-V tackles the challenge of producing tightly synchronized, long-form audio from silent video by introducing a masked flow matching framework with audio conditioning and a shortcut loss for fast sampling. The method combines a contrastively trained, high-resolution audiovisual synchronization backbone with masked in-/out-painting to extend audio beyond short clips, all without fine-tuning for speed. Empirical results show state-of-the-art synchronization and competitive audio quality, including a human listening study, and demonstrate robust few-step and long-form generation capabilities. The work enables practical, near-real-time video-to-audio synthesis with potential applications in Foley, sound design, and cinematic audio production.

Abstract

We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.

Paper Structure

This paper contains 20 sections, 5 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Architectural diagram of SALSA-V. We utilize a mixture of modified MMDiT-X blocks operating jointly over the combined sequence of audio and semantic features, as well as single-stream DiT blocks. The overall global conditioning signal consists of the text embedding, pooled semantic features, and time-step embedding. The local conditioning tensor equals the global conditioning summed to the sequence-aligned synchronization features.
  • Figure 2: Contrastive alignment process. Groups of subsequent video frames are aligned with the corresponding audio snippets. Only patches corresponding to the same time range in the same batch are counted as positives.
  • Figure 3: Qualitative example showing Mel-spectrogram of a generated audio clip with marked impacts, comparing SALSA-V with MMAudio. We observe that SALSA-V is able to better predict event alignment compared to previous methods.
  • Figure 4: Qualitative example of outpainting capabilities of SALSA-V. The video used for this example contains a turkey gobbling (with its appropriate sound). The dashed blue region is given as context. The mel-spectrograms visualize a generation with and without audio conditioning. With audio conditioning, the model uses the characteristic sound present in the first 2 seconds and performs outpainting beyond that point (outside of the blue box). Both samples use the ground-truth video frames as visual information. With conditioning, the characteristic sound is utilized (the bird's call, highlighted by red boxes) and other spectral features, such as the frequency top-end, are also better-preserved.
  • Figure 5: FAD and DeSync across different sampling steps (left) and generation lengths (right). SALSA-V is able to maintain generation quality with fewer sampling steps, and outperforms previous methods in long-form audio generation.
  • ...and 4 more figures