Table of Contents
Fetching ...

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Qi Yang, Binjie Mao, Zili Wang, Xing Nie, Pengfei Gao, Ying Guo, Cheng Zhen, Pengfei Yan, Shiming Xiang

TL;DR

Draw an Audio presents a controllable video-to-audio synthesis framework built on a Latent Diffusion Model backbone, equipped with a Mask-Attention Module to focus on video regions and a Time-Loudness Module to shape temporal and loudness dynamics. By supporting multi-instruction inputs—text prompts, drawn video masks, and hand-drawn loudness signals—the approach achieves improved content consistency, temporal alignment, and loudness realism. The model is trained on an extended VGGSound-Caption dataset and evaluated on AudioCaps and VGGSound-Caption, where it attains state-of-the-art performance and demonstrates multi-stage audio synthesis capabilities. Ablation studies confirm the effectiveness of MAM, TLM, and dual classifier-free guidance, highlighting the practical impact for automated Foley tasks and controllable sound design.

Abstract

Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: https://yannqi.github.io/Draw-an-Audio/.

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

TL;DR

Draw an Audio presents a controllable video-to-audio synthesis framework built on a Latent Diffusion Model backbone, equipped with a Mask-Attention Module to focus on video regions and a Time-Loudness Module to shape temporal and loudness dynamics. By supporting multi-instruction inputs—text prompts, drawn video masks, and hand-drawn loudness signals—the approach achieves improved content consistency, temporal alignment, and loudness realism. The model is trained on an extended VGGSound-Caption dataset and evaluated on AudioCaps and VGGSound-Caption, where it attains state-of-the-art performance and demonstrates multi-stage audio synthesis capabilities. Ablation studies confirm the effectiveness of MAM, TLM, and dual classifier-free guidance, highlighting the practical impact for automated Foley tasks and controllable sound design.

Abstract

Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: https://yannqi.github.io/Draw-an-Audio/.
Paper Structure (27 sections, 12 equations, 11 figures, 5 tables)

This paper contains 27 sections, 12 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Schematic illustrations of vanilla foley methods and our method. The traditional methods produce the entire audio only from video inputs once, demonstrating limitations in controllability and flexibility. Draw an Audio, in contrast, offers a more appealing alternative that employs multiple instructions to produce high-quality synchronized audio and can produce mixed audio in multi-stages, thereby exhibiting greater practical application.
  • Figure 2: The architecture of Draw an Audio, which incorporates a Latent Diffusion Model (LDM) as the foundational model, a Text Condition Model for text instruction, a Masked-Attention Module (MAM) for video instruction, a Time-Loudness Module (TLM) for signal instruction. Best viewed in color.
  • Figure 3: Visualization of generated samples. In fig. (a), it can be seen that while the other method fails to generate the temporal synchronized audio based on the video, Draw an Audio successfully produces both quiet and loud sound with temporal consistency. In fig. (b), our method can generate the sound with clear spikes matching the ground truth mel-spectrogram.
  • Figure 4: Effect on Dual CFG Scales. Figures (a)-(d) represent various objective evaluation metrics. By maintaining $s_{text}=3.5$ and gradually increasing $s_{video}$ from $1.5$ to $7.5$, a U-shaped curve is constructed. This pattern is similarly observed when the procedure is reversed with $s_{video}=4.5$.
  • Figure I: Visualization on Loudness Signal Production Process. (a) represents the original signal and (b),(c),(d) present the signals after processed by RMS energy, APP, EWMA, respectively.
  • ...and 6 more figures