Table of Contents
Fetching ...

Temporally Aligned Audio for Video with Autoregression

Ilpo Viertola, Vladimir Iashin, Esa Rahtu

TL;DR

The paper addresses the problem of generating audio from video with precise temporal alignment and semantic relevance. It introduces V-AURA, the first autoregressive video-to-audio model that leverages a high-framerate visual encoder, cross-modal feature fusion, and a token-based neural audio codec to produce temporally aligned waveforms. To support training and evaluation, it introduces VisualSound, a filtered subset of VGGSound with strong audio-visual correspondence, and a synchronization-based metric for temporal alignment. Empirical results show V-AURA achieves superior temporal synchronization and semantic relevance across multiple datasets with comparable audio fidelity, validating the autoregressive approach and the curated benchmark.

Abstract

We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at https://v-aura.notion.site

Temporally Aligned Audio for Video with Autoregression

TL;DR

The paper addresses the problem of generating audio from video with precise temporal alignment and semantic relevance. It introduces V-AURA, the first autoregressive video-to-audio model that leverages a high-framerate visual encoder, cross-modal feature fusion, and a token-based neural audio codec to produce temporally aligned waveforms. To support training and evaluation, it introduces VisualSound, a filtered subset of VGGSound with strong audio-visual correspondence, and a synchronization-based metric for temporal alignment. Empirical results show V-AURA achieves superior temporal synchronization and semantic relevance across multiple datasets with comparable audio fidelity, validating the autoregressive approach and the curated benchmark.

Abstract

We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at https://v-aura.notion.site
Paper Structure (14 sections, 2 figures, 4 tables)

This paper contains 14 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of V-AURA. Given stacks of RGB frames, the visual encoder extracts visual features which are projected into visual feature embeddings. Then, the temporal dimension of visual embeddings is aligned with the audio embeddings. The audio tokens from the previous generation step are embedded and added together to represent the full-band audio signal kumar2023highfidelityaudiocompressionimproved. The tokenized audio sequence is padded with learned padding tokens ($P$). Embeddings of different modalities are aligned and fused with cross-modal feature fusion before the next generation step in Transformer. When the audio sequence reaches the desired length, it is decoded back to a waveform using the decoder of the pre-trained codebook-based autoencoder.
  • Figure 2: V-AURA generates temporally matching audio. Diff-Foley luo2023difffoley misses some hits, whereas Frieren wang2024frierenefficientvideotoaudiogeneration generates too many. SpecVQGAN SpecVQGAN_Iashin_2021 does not generate distinguishable hits.