Table of Contents
Fetching ...

Video-Guided Foley Sound Generation with Multimodal Controls

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon

TL;DR

MultiFoley introduces a diffusion transformer framework for video guided Foley generation with multimodal conditioning in text, audio, and video. By training jointly on internet videos and high quality SFX libraries via a latent diffusion with a DAC-VAE audio encoder, it delivers synchronized, high fidelity 48 kHz audio and supports flexible control including text based editing, audio style transfer, and Foley extension. Quantitative and human studies show superior cross modal alignment and audio quality compared with existing methods, validating its utility for user in the loop sound design. The approach offers practical impact for film, game, and media production by enabling expressive, synchronized Foley with diverse conditioning signals and quality control controls.

Abstract

Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/

Video-Guided Foley Sound Generation with Multimodal Controls

TL;DR

MultiFoley introduces a diffusion transformer framework for video guided Foley generation with multimodal conditioning in text, audio, and video. By training jointly on internet videos and high quality SFX libraries via a latent diffusion with a DAC-VAE audio encoder, it delivers synchronized, high fidelity 48 kHz audio and supports flexible control including text based editing, audio style transfer, and Foley extension. Quantitative and human studies show superior cross modal alignment and audio quality compared with existing methods, validating its utility for user in the loop sound design. The approach offers practical impact for film, game, and media production by enabling expressive, synchronized Foley with diverse conditioning signals and quality control controls.

Abstract

Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/

Paper Structure

This paper contains 45 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: MultiFoley for video-guided sound generation with multimodal controls. We generate Foley sounds for silent videos with various control signals to shape their audio. (a) Text prompts, both positive and negative, guide synchronized Foley generation. (b) Reference audio from sound libraries defines the customized audio style. (c) A partial audio track is extended to produce a complete Foley sound. We encourage the reader to watch and listen to the results in our \projecturl.
  • Figure 2: Radar chart comparison for video-to-audio generation task. Each metric is normalized for a better visualization.
  • Figure 3: Method overview. We train our model jointly on a standard audio-video dataset VGGSound for VT2A generation and a high-quality audio-text dataset HQ-SFX for T2A generation. We encode the input audio into latents, adding noise to a portion of them. The silent video is encoded into visual features, concatenated with the audio latents along the channel dimension. The text input, including a quality tag, is encoded through a text encoder and applied via cross-attention.
  • Figure 4: Qualitative examples for Foley generation with text control. We present generated results for two videos, each with three different text prompts, demonstrating our model's ability to produce synchronized soundtracks with varied semantics through text control. Please refer to our \projecturl for video results.
  • Figure 5: Qualitative results of quality control. We show that VGGSound audio has limited bandwidth and demonstrate our model generates full-band 48kHz audio with quality control.
  • ...and 1 more figures