Table of Contents
Fetching ...

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

Yong Ren, Chenxing Li, Manjie Xu, Wei Liang, Yu Gu, Rilin Chen, Dong Yu

TL;DR

STA-V2A tackles the problem of generating audio harmonized with video by introducing a dual-video feature Refinement that captures local temporal cues via an onset-aware pretext task and global semantic cues via attentive pooling. It then leverages a T2A-initialized latent diffusion model with cross-modal guidance from text and refined video features to produce high-quality, semantically consistent, and temporally aligned audio, augmented by a new AA-Align metric for temporal accuracy. Key contributions include the onset-driven local temporal feature, the attentive pooling global semantic feature, the T2A-prior enhanced cross-modal LDM with ControlNet, and the AA-Align evaluation metric, with comprehensive experiments showing improvements over prior V2A methods. The approach promises practical impact for multimedia generation and synthesis by achieving more harmonious and synchronized audio in generated videos.

Abstract

Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal guidance. We also introduce Audio-Audio Align, a new metric to assess audio-temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at https://y-ren16.github.io/STAV2A.

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

TL;DR

STA-V2A tackles the problem of generating audio harmonized with video by introducing a dual-video feature Refinement that captures local temporal cues via an onset-aware pretext task and global semantic cues via attentive pooling. It then leverages a T2A-initialized latent diffusion model with cross-modal guidance from text and refined video features to produce high-quality, semantically consistent, and temporally aligned audio, augmented by a new AA-Align metric for temporal accuracy. Key contributions include the onset-driven local temporal feature, the attentive pooling global semantic feature, the T2A-prior enhanced cross-modal LDM with ControlNet, and the AA-Align evaluation metric, with comprehensive experiments showing improvements over prior V2A methods. The approach promises practical impact for multimedia generation and synthesis by achieving more harmonious and synchronized audio in generated videos.

Abstract

Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal guidance. We also introduce Audio-Audio Align, a new metric to assess audio-temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at https://y-ren16.github.io/STAV2A.
Paper Structure (16 sections, 8 equations, 1 figure, 3 tables)

This paper contains 16 sections, 8 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the STA-V2A framework. The local and global video feature refinement module extracts local temporal and global semantic video features through onset prediction loss and an attention pooling module. The pre-trained T2A model initializes the LDM, with text and global video features serving as semantic conditions introduced via cross-attention and local video features acting as temporal conditions introduced through an adapter.