Table of Contents
Fetching ...

TA-V2A: Textually Assisted Video-to-Audio Generation

Yuhuan You, Xihong Wu, Tianshu Qu

TL;DR

TA-V2A introduces text-assisted video-to-audio generation by integrating video, audio, and text modalities and applying large language model–driven text guidance to improve semantic representation. It uses a contrastive video-audio-language pretraining (CVALP) module, feature mixing, and a latent diffusion model with guided inference to generate temporally aligned audio from video. The approach demonstrates improved semantic fidelity and temporal synchronization on VGGSound, outperforming baselines in objective metrics (FID, FAD, MKL, Align) and achieving higher subjective MOS, especially when user-provided prompts are used. The work suggests that text-guided multimodal fusion and latent-space diffusion offer practical benefits for interactive, human-centered sound generation in multimedia contexts.

Abstract

As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame-based features. To address this, we present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model-based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text-guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video-to-audio generation.

TA-V2A: Textually Assisted Video-to-Audio Generation

TL;DR

TA-V2A introduces text-assisted video-to-audio generation by integrating video, audio, and text modalities and applying large language model–driven text guidance to improve semantic representation. It uses a contrastive video-audio-language pretraining (CVALP) module, feature mixing, and a latent diffusion model with guided inference to generate temporally aligned audio from video. The approach demonstrates improved semantic fidelity and temporal synchronization on VGGSound, outperforming baselines in objective metrics (FID, FAD, MKL, Align) and achieving higher subjective MOS, especially when user-provided prompts are used. The work suggests that text-guided multimodal fusion and latent-space diffusion offer practical benefits for interactive, human-centered sound generation in multimedia contexts.

Abstract

As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame-based features. To address this, we present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model-based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text-guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video-to-audio generation.

Paper Structure

This paper contains 13 sections, 14 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The complete workflow of the TA-V2A generation system. The system takes video and textual descriptions as input, with the textual description generated by an LLM. The CVALP module extracts and aligns features from video, audio, and text, creating audio-aligned video and text features. These features are then fed into LDM, which iteratively generates high-quality audio from noise. During inference, guidance techniques such as CFG and human-modified text prompts are used to control the generation process, ensuring better alignment between the generated audio and the input modalities. Finally, the audio representation is decoded into a Mel-spectrogram and synthesized into the actual audio waveform using a vocoder.
  • Figure 2: An Example of Video-Audio Alignment. The top shows frames from a badminton sequence, while the bottom compares audio spectrograms from different methods: Ground Truth, TA-V2A, Diff-Foley, and VTA-LDM. Yellow boxes highlight key synchronized moments between video and audio.