Table of Contents
Fetching ...

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

Yaoyun Zhang, Xuenan Xu, Mengyue Wu

TL;DR

Smooth-Foley tackles the challenge of continuous, semantically accurate video-to-audio generation under semantic guidance. It introduces two lightweight adapters—the frame adapter and the temporal adapter—built on a pre-trained diffusion-based T2A model (Auffusion) and augmented with frame-wise CLIP features and textual-label guidance to improve both semantic and temporal alignment. By projecting high-resolution frame features and incorporating label-guided temporal conditioning, Smooth-Foley achieves superior quality and synchronization on VGGSound and the challenging VGGSound-Continuous subset, including better Doppler-like dynamics. The approach demonstrates that efficient, targeted adapters can markedly improve continuous Foley generation in moving-visual contexts with practical implications for automated film post-processing.

Abstract

The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio-video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth-Foley exhibits higher quality and better adherence to physical laws.

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

TL;DR

Smooth-Foley tackles the challenge of continuous, semantically accurate video-to-audio generation under semantic guidance. It introduces two lightweight adapters—the frame adapter and the temporal adapter—built on a pre-trained diffusion-based T2A model (Auffusion) and augmented with frame-wise CLIP features and textual-label guidance to improve both semantic and temporal alignment. By projecting high-resolution frame features and incorporating label-guided temporal conditioning, Smooth-Foley achieves superior quality and synchronization on VGGSound and the challenging VGGSound-Continuous subset, including better Doppler-like dynamics. The approach demonstrates that efficient, targeted adapters can markedly improve continuous Foley generation in moving-visual contexts with practical implications for automated film post-processing.

Abstract

The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio-video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth-Foley exhibits higher quality and better adherence to physical laws.

Paper Structure

This paper contains 18 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Examples to clarify the deficits. In the context of continuous sound and ambiguous object, FoleyCrafter (FC) fails to predict the presence of sound. In the first case, a train stays in the video for a long time but FC predicts event probabilities of nearly zero for some frames. In the second case, when flying airplane becomes tiny, FC fails to detect it, leading to unsatisfactory generation results.
  • Figure 2: Overall pipeline of Smooth-Foley. Note that frame adapter and temporal controller module are trained separately.
  • Figure 3: Label distribution of VGGSound-Continuous, most are sounding objects in movement.
  • Figure 4: Qualitative comparison on temporal alignment with different models, i.e. FoleyCrafter (FC), Diff-Foley (DF) and Smooth-Foley (ours).