FolAI: Synchronized Foley Sound Generation with Semantic and Temporal Alignment
Riccardo Fosco Gramaccioni, Christian Marinoni, Emilian Postolache, Marco Comunità, Luca Cosmo, Joshua D. Reiss, Danilo Comminiello
TL;DR
FolAI tackles the challenge of producing temporally precise and semantically controllable Foley-like sounds from video by decoupling timing and content. It introduces a two-stage pipeline that first predicts a smooth RMS envelope from video to capture timing, and then synthesizes the final $44.1\mathrm{kHz}$ stereo audio with a diffusion model guided by both the envelope and semantic embeddings via a ControlNet-augmented Diffusion Transformer built on Stable Audio. The approach enables explicit temporal control and user-driven semantic adjustment, and it introduces Walking The Maps, a footstep-focused dataset, while achieving state-of-the-art results on Greatest Hits and robust performance on Walking The Maps. The work demonstrates practical viability for professional Foley workflows, showing that combining an interpretable temporal scaffold with rich semantic conditioning yields high-fidelity, well-aligned audio.
Abstract
Traditional sound design workflows rely on manual alignment of audio events to visual cues, as in Foley sound design, where everyday actions like footsteps or object interactions are recreated to match the on-screen motion. This process is time-consuming, difficult to scale, and lacks automation tools that preserve creative intent. Despite recent advances in vision-to-audio generation, producing temporally coherent and semantically controllable sound effects from video remains a major challenge. To address these limitations, we introduce FolAI, a two-stage generative framework that decouples the when and the what of sound synthesis, i.e., the temporal structure extraction and the semantically guided generation, respectively. In the first stage, we estimate a smooth control signal from the video that captures the motion intensity and rhythmic structure over time, serving as a temporal scaffold for the audio. In the second stage, a diffusion-based generative model produces sound effects conditioned both on this temporal envelope and on high-level semantic embeddings, provided by the user, that define the desired auditory content (e.g., material or action type). This modular design enables precise control over both timing and timbre, streamlining repetitive tasks while preserving creative flexibility in professional Foley workflows. Results on diverse visual contexts, such as footstep generation and action-specific sonorization, demonstrate that our model reliably produces audio that is temporally aligned with visual motion, semantically consistent with user intent, and perceptually realistic. These findings highlight the potential of FolAI as a controllable and modular solution for scalable, high-quality Foley sound synthesis in professional and interactive settings. Supplementary materials are accessible on our dedicated demo page at https://ispamm.github.io/FolAI.
