Table of Contents
Fetching ...

CAFA: a Controllable Automatic Foley Artist

Roi Benita, Michael Finkelson, Tavi Halperin, Gleb Sterkin, Yossi Adi

TL;DR

CAFA tackles controllable Foley synthesis by unifying text and video cues to generate semantically and temporally aligned audio. The method couples a pretrained text-to-audio backbone with a trainable modality adapter, conditioned by video representations such as AVCLIP or CLIP, and guided by asymmetric classifier-free guidance to balance text and video conditioning. It demonstrates competitive audio quality and superior text-based controllability, validated through objective metrics and human studies, while achieving substantially lower training costs due to its modular, adapter-based design. Overall, CAFA offers a practical, flexible framework for text-and-video-driven audio generation with strong alignment and creative control.

Abstract

Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video-to-audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video-and-text-to-audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text-to-audio model and integrates video information through a modality adapter mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio-visual synchronization the proposed method enable high textual controllability as demonstrated in subjective and objective evaluations.

CAFA: a Controllable Automatic Foley Artist

TL;DR

CAFA tackles controllable Foley synthesis by unifying text and video cues to generate semantically and temporally aligned audio. The method couples a pretrained text-to-audio backbone with a trainable modality adapter, conditioned by video representations such as AVCLIP or CLIP, and guided by asymmetric classifier-free guidance to balance text and video conditioning. It demonstrates competitive audio quality and superior text-based controllability, validated through objective metrics and human studies, while achieving substantially lower training costs due to its modular, adapter-based design. Overall, CAFA offers a practical, flexible framework for text-and-video-driven audio generation with strong alignment and creative control.

Abstract

Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video-to-audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video-and-text-to-audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text-to-audio model and integrates video information through a modality adapter mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio-visual synchronization the proposed method enable high textual controllability as demonstrated in subjective and objective evaluations.

Paper Structure

This paper contains 21 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Motivation. An iconic scene from Jurassic Park, where water in a glass shakes due to the approaching footsteps of a T-Rex. Inferring the generated sound from the video alone is insufficient, as the task is inherently ambiguous. Top: a representative frame and a Y-T slice (from the purple column), where the temporal cue of the shake is faintly visible. Bottom: Our method leverages the prompt "T-Rex Stomping" to generate a synchronized audio track that aligns with both the visual timing and artistic intent.
  • Figure 2: (a) Method overview: our model is text-and-video-to-audio, leverages pretrained models for audio generation, and video encoding. The original audio and VAE audio encoder are only used during training. (b) Adaptor: Illustration of the internal connectivity between the backbone T2A model and our video conditioning adaptor, with fully connected (FC) layers explicitly shown.
  • Figure 3: Qualitative Comparison of Text-Video Disentanglement. A comparative analysis of various TV2A models: Ground Truth (GT), CAFA (ours), MMAudio, ReWaS, and FoleyCrafter, using the same configurations as in Table \ref{['tab:disent']}. Our model consistently delivers synchronized, high-quality generations that accurately adhere to the requested target captions, outperforming other approaches. Full videos presented at https://anonymous.4open.science/w/CAFA.
  • Figure 4: Comparison of Asymmetric CFG Scaling Values. CAFA-TF is our adapter applied with TangoFluxhung2024tangoflux, while the default implementation uses StableAudio-Openevans2024stable.
  • Figure 5: User study form
  • ...and 1 more figures