Table of Contents
Fetching ...

Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control

Bingliang Li, Fengyu Yang, Yuxin Mao, Qingwen Ye, Hongkai Chen, Yiran Zhong

TL;DR

The paper tackles the limitations of existing video-to-audio (V2A) models, which struggle with fine-grained loudness control, multi-modal conditioning, and stereo, long-duration outputs. It introduces Tri-Ergon, a diffusion-based V2A model that leverages textual, auditory, and visual prompts along with LUFS embedding to control loudness over time, enabling stereo 44.1 kHz audio up to 60 seconds. A new MM-V2A dataset with high-fidelity, long-duration, open-vocabulary multi-modal labeling supports training. The method achieves superior qualitative and quantitative performance and holds significant potential for professional Foley workflows and broader media applications, including film, gaming, and VR, by enabling detailed, semantically aligned audio synthesis from video.

Abstract

Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.

Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control

TL;DR

The paper tackles the limitations of existing video-to-audio (V2A) models, which struggle with fine-grained loudness control, multi-modal conditioning, and stereo, long-duration outputs. It introduces Tri-Ergon, a diffusion-based V2A model that leverages textual, auditory, and visual prompts along with LUFS embedding to control loudness over time, enabling stereo 44.1 kHz audio up to 60 seconds. A new MM-V2A dataset with high-fidelity, long-duration, open-vocabulary multi-modal labeling supports training. The method achieves superior qualitative and quantitative performance and holds significant potential for professional Foley workflows and broader media applications, including film, gaming, and VR, by enabling detailed, semantically aligned audio synthesis from video.

Abstract

Video-to-audio (V2A) generation utilizes visual-only video features to produce realistic sounds that correspond to the scene. However, current V2A models often lack fine-grained control over the generated audio, especially in terms of loudness variation and the incorporation of multi-modal conditions. To overcome these limitations, we introduce Tri-Ergon, a diffusion-based V2A model that incorporates textual, auditory, and pixel-level visual prompts to enable detailed and semantically rich audio synthesis. Additionally, we introduce Loudness Units relative to Full Scale (LUFS) embedding, which allows for precise manual control of the loudness changes over time for individual audio channels, enabling our model to effectively address the intricate correlation of video and audio in real-world Foley workflows. Tri-Ergon is capable of creating 44.1 kHz high-fidelity stereo audio clips of varying lengths up to 60 seconds, which significantly outperforms existing state-of-the-art V2A methods that typically generate mono audio for a fixed duration.
Paper Structure (3 sections)

This paper contains 3 sections.