Table of Contents
Fetching ...

FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment

Riccardo Fosco Gramaccioni, Christian Marinoni, Emilian Postolache, Marco Comunità, Luca Cosmo, Joshua D. Reiss, Danilo Comminiello

TL;DR

FolAI tackles the challenge of producing temporally precise and semantically controllable Foley-like sounds from video by decoupling timing and content. It introduces a two-stage pipeline that first predicts a smooth RMS envelope from video to capture timing, and then synthesizes the final $44.1\mathrm{kHz}$ stereo audio with a diffusion model guided by both the envelope and semantic embeddings via a ControlNet-augmented Diffusion Transformer built on Stable Audio. The approach enables explicit temporal control and user-driven semantic adjustment, and it introduces Walking The Maps, a footstep-focused dataset, while achieving state-of-the-art results on Greatest Hits and robust performance on Walking The Maps. The work demonstrates practical viability for professional Foley workflows, showing that combining an interpretable temporal scaffold with rich semantic conditioning yields high-fidelity, well-aligned audio.

Abstract

Traditional sound design workflows rely on manual alignment of audio events to visual cues, as in Foley sound design, where everyday actions like footsteps or object interactions are recreated to match the on-screen motion. This process is time-consuming, difficult to scale, and lacks automation tools that preserve creative intent. Despite recent advances in vision-to-audio generation, producing temporally coherent and semantically controllable sound effects from video remains a major challenge. To address these limitations, we introduce FolAI, a two-stage generative framework that decouples the when and the what of sound synthesis, i.e., the temporal structure extraction and the semantically guided generation, respectively. In the first stage, we estimate a smooth control signal from the video that captures the motion intensity and rhythmic structure over time, serving as a temporal scaffold for the audio. In the second stage, a diffusion-based generative model produces sound effects conditioned both on this temporal envelope and on high-level semantic embeddings, provided by the user, that define the desired auditory content (e.g., material or action type). This modular design enables precise control over both timing and timbre, streamlining repetitive tasks while preserving creative flexibility in professional Foley workflows. Results on diverse visual contexts, such as footstep generation and action-specific sonorization, demonstrate that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic. These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high-quality Foley sound synthesis in professional and interactive settings. Supplementary materials are accessible on our dedicated demo page at https://ispamm.github.io/FolAI.

FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment

TL;DR

FolAI tackles the challenge of producing temporally precise and semantically controllable Foley-like sounds from video by decoupling timing and content. It introduces a two-stage pipeline that first predicts a smooth RMS envelope from video to capture timing, and then synthesizes the final stereo audio with a diffusion model guided by both the envelope and semantic embeddings via a ControlNet-augmented Diffusion Transformer built on Stable Audio. The approach enables explicit temporal control and user-driven semantic adjustment, and it introduces Walking The Maps, a footstep-focused dataset, while achieving state-of-the-art results on Greatest Hits and robust performance on Walking The Maps. The work demonstrates practical viability for professional Foley workflows, showing that combining an interpretable temporal scaffold with rich semantic conditioning yields high-fidelity, well-aligned audio.

Abstract

Traditional sound design workflows rely on manual alignment of audio events to visual cues, as in Foley sound design, where everyday actions like footsteps or object interactions are recreated to match the on-screen motion. This process is time-consuming, difficult to scale, and lacks automation tools that preserve creative intent. Despite recent advances in vision-to-audio generation, producing temporally coherent and semantically controllable sound effects from video remains a major challenge. To address these limitations, we introduce FolAI, a two-stage generative framework that decouples the when and the what of sound synthesis, i.e., the temporal structure extraction and the semantically guided generation, respectively. In the first stage, we estimate a smooth control signal from the video that captures the motion intensity and rhythmic structure over time, serving as a temporal scaffold for the audio. In the second stage, a diffusion-based generative model produces sound effects conditioned both on this temporal envelope and on high-level semantic embeddings, provided by the user, that define the desired auditory content (e.g., material or action type). This modular design enables precise control over both timing and timbre, streamlining repetitive tasks while preserving creative flexibility in professional Foley workflows. Results on diverse visual contexts, such as footstep generation and action-specific sonorization, demonstrate that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic. These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high-quality Foley sound synthesis in professional and interactive settings. Supplementary materials are accessible on our dedicated demo page at https://ispamm.github.io/FolAI.

Paper Structure

This paper contains 30 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Example showing ground truth audio and video, predicted RMS envelope and generated audio.
  • Figure 2: FOL·AI architecture consists of two distinct parts: the video model, that predicts an envelope representative for the audio directly from the input video, and the audio synthesis model for the controlled generation of the final audio effect. The generation is controlled temporally by the predicted RMS envelope through a DiT ControlNet, and semantically by CLAP and CAVP embeddings. The length of the output waveform can be controlled with $\mathrm{seconds\_start}$ and $\mathrm{seconds\_total}$ parameters.
  • Figure 3: Example of the ground truth envelope before and after the smoothing operation.
  • Figure 4: Block diagram for the proposed the video model.
  • Figure 5: Block diagram for the audio synthesis model. Trainable modules are ControlNet blocks while Stable DiT blocks are frozen.
  • ...and 1 more figures