Table of Contents
Fetching ...

Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

Abstract

We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall quality of the final composite audio, outperforming existing baselines.

Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

Abstract

We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach aims to comprehensively capture all sound events induced by a video through the incremental generation of missing sound events. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of existing sounds. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from adjacent segments of the same video, allowing training with standard single-reference audiovisual datasets that are easily accessible. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall quality of the final composite audio, outperforming existing baselines.

Paper Structure

This paper contains 10 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure A1: Overview of the dataset construction pipeline. Multiple audio captions were generated for each video using Qwen2.5-VL via a two-step process: free-form captioning followed by structured JSON formatting. The input prompt on the top is simplified; see Fig. \ref{['sup:fig:prompt']} for the full version.
  • Figure A2: Full prompt for generating multiple possible audio captions.
  • Figure A3: Examples of the Multi-Caps VGGSound dataset. We added multiple captions to the test split of the VGGSound dataset using Qwen2.5-VL shown in \ref{['sup:fig:dataset_construction']}.
  • Figure A4: Additional spectrogram visualization. Our proposed method effectively suppresses previously generated sounds in subsequent steps while maintaining high alignment with the text prompts. The best and worst CLAP A-A and T-A scores are highlighted in red and blue, respectively.