Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Juncheng Wang; Chao Xu; Cheng Yu; Lei Shang; Zhe Hu; Shujun Wang; Liefeng Bo

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Juncheng Wang, Chao Xu, Cheng Yu, Lei Shang, Zhe Hu, Shujun Wang, Liefeng Bo

TL;DR

The paper tackles synchronized video-to-audio generation by introducing Mel Quantization-Continuum Decomposition (Mel-QCD), which decomposes mel-spectrograms into semantic, energy, and standard-deviation components. The semantic part is quantized via Semantic Vector Quantization (SVQ) to reduce prediction complexity, while energy and standard-deviation remain continuous, enabling accurate video-driven prediction through a V2X predictor. A textual inversion module refines global semantics, and a ControlNet-enabled diffusion model synthesizes audio conditioned on both the predicted Mel-QCD and textual guidance. Evaluated on VGGSound with AvSync15, Mel-QCD achieves state-of-the-art performance across multiple quality, synchronization, and semantic metrics, demonstrating a favorable balance between completeness and complexity in the controlling signal. The approach offers a scalable, controllable V2A pipeline with practical implications for video editing, accessibility, and AI-assisted content creation.

Abstract

Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos. Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD). We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor. Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process. Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency. Our codes and demos will be released at \href{Website}{https://wjc2830.github.io/MelQCD/}.

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

TL;DR

Abstract

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (1)