Table of Contents
Fetching ...

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Juncheng Wang, Chao Xu, Cheng Yu, Lei Shang, Zhe Hu, Shujun Wang, Liefeng Bo

TL;DR

The paper tackles synchronized video-to-audio generation by introducing Mel Quantization-Continuum Decomposition (Mel-QCD), which decomposes mel-spectrograms into semantic, energy, and standard-deviation components. The semantic part is quantized via Semantic Vector Quantization (SVQ) to reduce prediction complexity, while energy and standard-deviation remain continuous, enabling accurate video-driven prediction through a V2X predictor. A textual inversion module refines global semantics, and a ControlNet-enabled diffusion model synthesizes audio conditioned on both the predicted Mel-QCD and textual guidance. Evaluated on VGGSound with AvSync15, Mel-QCD achieves state-of-the-art performance across multiple quality, synchronization, and semantic metrics, demonstrating a favorable balance between completeness and complexity in the controlling signal. The approach offers a scalable, controllable V2A pipeline with practical implications for video editing, accessibility, and AI-assisted content creation.

Abstract

Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos. Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD). We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor. Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process. Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency. Our codes and demos will be released at \href{Website}{https://wjc2830.github.io/MelQCD/}.

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

TL;DR

The paper tackles synchronized video-to-audio generation by introducing Mel Quantization-Continuum Decomposition (Mel-QCD), which decomposes mel-spectrograms into semantic, energy, and standard-deviation components. The semantic part is quantized via Semantic Vector Quantization (SVQ) to reduce prediction complexity, while energy and standard-deviation remain continuous, enabling accurate video-driven prediction through a V2X predictor. A textual inversion module refines global semantics, and a ControlNet-enabled diffusion model synthesizes audio conditioned on both the predicted Mel-QCD and textual guidance. Evaluated on VGGSound with AvSync15, Mel-QCD achieves state-of-the-art performance across multiple quality, synchronization, and semantic metrics, demonstrating a favorable balance between completeness and complexity in the controlling signal. The approach offers a scalable, controllable V2A pipeline with practical implications for video editing, accessibility, and AI-assisted content creation.

Abstract

Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos. Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD). We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor. Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process. Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency. Our codes and demos will be released at \href{Website}{https://wjc2830.github.io/MelQCD/}.

Paper Structure

This paper contains 14 sections, 1 theorem, 6 equations, 5 figures, 7 tables.

Key Result

Proposition 1

Given a mel-map represented audio $\mathbf{M}$, the components $\mathbf{S}_{.,t}$ tend to be distinguishable concerning sound events, while the other two components, $\mathbf{E}_{t}$ and $\mathbf{D}_{t}$, are continuously distributed across different sound events.

Figures (5)

  • Figure 1: Task formulation of this paper (a). Previous mainstream approaches focus on extracting control signals from videos to govern audio generation (b). However, they struggle to balance between ease of prediction and precision of control. In response, our proposed Mel-QCD achieves a more effective trade-off (c).
  • Figure 2: Pipeline for the proposed Mel-QCD controllable video-to-audio (V2A) generation. The process is divided into two parts: (a) Pre-training, which outlines how to derive Mel-QCD from videos; and (b) Training, which explains how to utilize Mel-QCD and textual inversion to train the video-controlled audio generation model.
  • Figure 3: Properties of each component of the mel-map containing two sound events: shooting and gunshot spreading. The energy reflects the continuum of the mel-map while the normalized mel reflects the semantic clustering property.
  • Figure 4: Case study on the VGGSound test set. The first row displays the video frame, followed by mel spectrograms from the ground truth (GT), VTA-LDM, FoleyCrafter, and our method. As shown, our result performs better than others over synchronization.
  • Figure 5: Visual comparison with different variants of our method.

Theorems & Definitions (1)

  • Proposition 1: Properties of Each Component