FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders
Riccardo Fosco Gramaccioni, Christian Marinoni, Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello
TL;DR
FoleyGRAM tackles video-to-audio generation by enforcing cross-modal semantic alignment with GRAM across video, text, and audio, and by conditioning a diffusion-based audio synthesizer on GRAM embeddings plus waveform envelopes. The method unifies multimodal representations via GRAM losses and leverages ControlNet-inspired temporal conditioning to ensure audio tracks synchronize with video events. Empirical results on the Greatest Hits dataset show FoleyGRAM achieves superior semantic fidelity and temporal alignment compared to state-of-the-art baselines, with ablations confirming the benefit of multi-modal conditioning (AVT). The approach offers practical utility for sound design, delivering high-quality 44.1 kHz stereo audio with efficient training and small data requirements. By combining GRAM-based conditioning with diffusion-based synthesis, FoleyGRAM advances cross-modal generation and enables flexible, semantically rich V2A control for real-world applications.
Abstract
In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.
