CAFA: a Controllable Automatic Foley Artist
Roi Benita, Michael Finkelson, Tavi Halperin, Gleb Sterkin, Yossi Adi
TL;DR
CAFA tackles controllable Foley synthesis by unifying text and video cues to generate semantically and temporally aligned audio. The method couples a pretrained text-to-audio backbone with a trainable modality adapter, conditioned by video representations such as AVCLIP or CLIP, and guided by asymmetric classifier-free guidance to balance text and video conditioning. It demonstrates competitive audio quality and superior text-based controllability, validated through objective metrics and human studies, while achieving substantially lower training costs due to its modular, adapter-based design. Overall, CAFA offers a practical, flexible framework for text-and-video-driven audio generation with strong alignment and creative control.
Abstract
Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video-to-audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video-and-text-to-audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text-to-audio model and integrates video information through a modality adapter mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio-visual synchronization the proposed method enable high textual controllability as demonstrated in subjective and objective evaluations.
