Table of Contents
Fetching ...

FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders

Riccardo Fosco Gramaccioni, Christian Marinoni, Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello

TL;DR

FoleyGRAM tackles video-to-audio generation by enforcing cross-modal semantic alignment with GRAM across video, text, and audio, and by conditioning a diffusion-based audio synthesizer on GRAM embeddings plus waveform envelopes. The method unifies multimodal representations via GRAM losses and leverages ControlNet-inspired temporal conditioning to ensure audio tracks synchronize with video events. Empirical results on the Greatest Hits dataset show FoleyGRAM achieves superior semantic fidelity and temporal alignment compared to state-of-the-art baselines, with ablations confirming the benefit of multi-modal conditioning (AVT). The approach offers practical utility for sound design, delivering high-quality 44.1 kHz stereo audio with efficient training and small data requirements. By combining GRAM-based conditioning with diffusion-based synthesis, FoleyGRAM advances cross-modal generation and enables flexible, semantically rich V2A control for real-world applications.

Abstract

In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.

FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders

TL;DR

FoleyGRAM tackles video-to-audio generation by enforcing cross-modal semantic alignment with GRAM across video, text, and audio, and by conditioning a diffusion-based audio synthesizer on GRAM embeddings plus waveform envelopes. The method unifies multimodal representations via GRAM losses and leverages ControlNet-inspired temporal conditioning to ensure audio tracks synchronize with video events. Empirical results on the Greatest Hits dataset show FoleyGRAM achieves superior semantic fidelity and temporal alignment compared to state-of-the-art baselines, with ablations confirming the benefit of multi-modal conditioning (AVT). The approach offers practical utility for sound design, delivering high-quality 44.1 kHz stereo audio with efficient training and small data requirements. By combining GRAM-based conditioning with diffusion-based synthesis, FoleyGRAM advances cross-modal generation and enables flexible, semantically rich V2A control for real-world applications.

Abstract

In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.

Paper Structure

This paper contains 23 sections, 10 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Example showing ground truth audio and video and relative waveform generated by the proposed method.
  • Figure 2: GRAM framework, in which the representation learned from the three encoders (EVAClip-ViT-G for video, BERT-B for text, and BEATs for audio) shape the edges of the high-dimensional parallelotope, whose volume provides insights on the alignment of the data.
  • Figure 3: FoleyGRAM architecture: relevant semantic features are extracted from reference video, audio, and text through GRAM-aligned multimodal encoders. These features are used to condition an audio synthesis model that, together with the temporal information provided as an envelope signal used as input to a ControlNet, generates an audio that is temporally and semantically aligned with the reference video. At inference time, the three modalities can be used jointly or separately to generate the desired output. The samples used to condition the generation process can also be completely different from the semantic characteristics related to the video to be sonorized, allowing the sound designers to choose as they like the samples with which they can define the semantics for the audio to be generated.