FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Yiming Zhang; Yicheng Gu; Yanhong Zeng; Zhening Xing; Yuancheng Wang; Zhizheng Wu; Kai Chen

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen

TL;DR

FoleyCrafter tackles the challenge of generating high-quality, video-aligned sound effects for silent videos by plugging a fixed pre-trained text-to-audio model with two trainable modules: a semantic adapter that aligns video content semantically with audio output via parallel cross-attention, and a temporal controller that synchronizes audio events to video timing using a timestamp detector and a ControlNet-inspired adapter. The approach preserves the audio quality of the underlying T2A model while achieving strong semantic and temporal alignment, demonstrated across VGGSound and AVSync15 benchmarks with state-of-the-art metrics and qualitative results. Text-prompt controllability enables fine-grained, diverse Foley conditioned on user intent, and ablation studies validate the importance of both adapters. While offering practical benefits for film and media production, the work also acknowledges potential misuse and emphasizes responsible deployment.

Abstract

We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To overcome these limitations, we propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. FoleyCrafter comprises two key components: the semantic adapter for semantic alignment and the temporal controller for precise audio-video synchronization. The semantic adapter utilizes parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. Meanwhile, the temporal controller incorporates an onset detector and a timestampbased adapter to achieve precise audio-video alignment. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents. We conduct extensive quantitative and qualitative experiments on standard benchmarks to verify the effectiveness of FoleyCrafter. Models and codes are available at https://github.com/open-mmlab/FoleyCrafter.

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

TL;DR

Abstract

Paper Structure (37 sections, 8 equations, 8 figures, 5 tables)

This paper contains 37 sections, 8 equations, 8 figures, 5 tables.

Introduction
Related Work
Diffusion-based Audio Generation.
Video-to-Audio Generation.
Approach
Preliminaries
Audio Latent Diffusion Model.
Conditioning Mechanisms.
FoleyCrafter
Semantic Adapter
Visual Encoder.
Semantic Adapter.
Temporal Controller
Timestamp Detector.
Timestamp-based Adapter.
...and 22 more sections

Figures (8)

Figure 1: (a) (Video-to-audio) V2A methods struggle with audio quality due to noisy training data, while (b) video-to-text (V2T) methods encounter difficulties in producing synchronized sounds. Our model, FoleyCrafter (FC), integrates a learnable module into a pre-trained Text-to-Audio (T2A) model to ensure audio quality while enhancing video-audio alignment with the supervision of audios.
Figure 2: The overview of FoleyCrafter. FoleyCrafter is built upon a pre-trained text-to-audio (T2A) generator, ensuring high-quality audio synthesis. It comprises two main components: the semantic adapter (S.A.) and the temporal controller, which includes a timestamp detector (T.D.) and a temporal adapter (T.A.). Both the semantic adapter and the temporal controller are trainable modules that take videos as input to synthesize audio, with audio supervision for optimization. The T2A model remains fixed to maintain its established capability for high-quality audio synthesis.
Figure 3: The overview of semantic adapter. Semantic adapter employs a pre-trained visual encoder with several learnable layers to extract video embeddings that align better with the text-to-audio generator. Then, it integrates trainable visual-cross attention mechanisms alongside text-based ones, ensuring semantic alignment with the video without compromising text-to-audio generation.
Figure 4: The overview of the temporal controller. The temporal module consists of a timestamp detector and a temporal adapter for improved video-audio synchronization. The timestamp detector predicts sound and silence labels based on the video, optimized using ground truth audio event timestamps. The temporal adapter, initialized from UNet encoder blocks, encodes the timestamp condition and injects synchronization information into the UNet decoder.
Figure 5: Qualitative comparison. As shown in the first case, both SpecVQGAN and Diff-Foley fail to capture the onset of the gunshot sound. In contrast, FoleyCrafter generates the gunshot sound synchronized with the video, showcasing its superior temporal alignment capability.
...and 3 more figures

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

TL;DR

Abstract

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Authors

TL;DR

Abstract

Table of Contents

Figures (8)