Table of Contents
Fetching ...

LoVA: Long-form Video-to-Audio Generation

Xin Cheng, Xihua Wang, Yihan Wu, Yuyue Wang, Ruihua Song

TL;DR

LoVA addresses the long-form video-to-audio generation problem where existing methods struggle with consistency across extended durations. It introduces a Latent Diffusion Transformer that uses a Diffusion Transformer denoiser conditioned on extended video features from CLIP and a VAE-based audio latent, enabling parallel denoising over arbitrarily long sequences; the diffusion process is described by $z_t = \sqrt{\bar{\alpha_t}} z_0 + \sqrt{1-\bar{\alpha_t}} \epsilon$, with $\epsilon \sim \mathcal{N}(0, I)$, and loss $\mathcal{L} = \|\hat{\epsilon}_t - \epsilon\|^2$. Compared to autoregressive and UNet-based baselines, LoVA achieves state-of-the-art performance on the long-form UnAV100 dataset and competitive results on the short-form VGGSound, while supporting audio durations roughly six times longer than prior diffusion models. The work emphasizes a practical long-form V2A evaluation protocol and demonstrates that DiT-based diffusion better handles long-range dependencies and consistency, with future directions in temporal synchronization and controllability via text and duration control.

Abstract

Video-to-audio (V2A) generation is important for video editing and post-processing, enabling the creation of semantics-aligned audio for silent video. However, most existing methods focus on generating short-form audio for short video segment (less than 10 seconds), while giving little attention to the scenario of long-form video inputs. For current UNet-based diffusion V2A models, an inevitable problem when handling long-form audio generation is the inconsistencies within the final concatenated audio. In this paper, we first highlight the importance of long-form V2A problem. Besides, we propose LoVA, a novel model for Long-form Video-to-Audio generation. Based on the Diffusion Transformer (DiT) architecture, LoVA proves to be more effective at generating long-form audio compared to existing autoregressive models and UNet-based diffusion models. Extensive objective and subjective experiments demonstrate that LoVA achieves comparable performance on 10-second V2A benchmark and outperforms all other baselines on a benchmark with long-form video input.

LoVA: Long-form Video-to-Audio Generation

TL;DR

LoVA addresses the long-form video-to-audio generation problem where existing methods struggle with consistency across extended durations. It introduces a Latent Diffusion Transformer that uses a Diffusion Transformer denoiser conditioned on extended video features from CLIP and a VAE-based audio latent, enabling parallel denoising over arbitrarily long sequences; the diffusion process is described by , with , and loss . Compared to autoregressive and UNet-based baselines, LoVA achieves state-of-the-art performance on the long-form UnAV100 dataset and competitive results on the short-form VGGSound, while supporting audio durations roughly six times longer than prior diffusion models. The work emphasizes a practical long-form V2A evaluation protocol and demonstrates that DiT-based diffusion better handles long-range dependencies and consistency, with future directions in temporal synchronization and controllability via text and duration control.

Abstract

Video-to-audio (V2A) generation is important for video editing and post-processing, enabling the creation of semantics-aligned audio for silent video. However, most existing methods focus on generating short-form audio for short video segment (less than 10 seconds), while giving little attention to the scenario of long-form video inputs. For current UNet-based diffusion V2A models, an inevitable problem when handling long-form audio generation is the inconsistencies within the final concatenated audio. In this paper, we first highlight the importance of long-form V2A problem. Besides, we propose LoVA, a novel model for Long-form Video-to-Audio generation. Based on the Diffusion Transformer (DiT) architecture, LoVA proves to be more effective at generating long-form audio compared to existing autoregressive models and UNet-based diffusion models. Extensive objective and subjective experiments demonstrate that LoVA achieves comparable performance on 10-second V2A benchmark and outperforms all other baselines on a benchmark with long-form video input.
Paper Structure (14 sections, 5 equations, 3 figures, 1 table)

This paper contains 14 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Long-form V2A example. Current (8s/10s) UNet-based diffusion V2A models (DiffFoley, TiVA, FoleyCrafter) exhibit inconsistency when generating long-form (30s) audio, as indicated by clear mel-spectrogram boundaries and structural variances. In contrast, our LoVA produces consistent results similar to the ground truth.
  • Figure 2: (a) Comparison of three distinct long-form V2A methods. From top to bottom: autoregressive methods, UNet-based diffusions, DiT-based diffusions (our LoVA), characterized by inefficient one-by-one generation manner, inconsistent fixed-length splits generation, and our parallel processing of arbitrary-length sequences respectively. (b) Overview of LoVA. Capable of accepting videos of any length, it samples and denoises on the corresponding length of the latent noise sequence and then decodes it to generate audio of any length.
  • Figure 3: Comparison of long-form audio generation ability between UNet and DiT structure. We choose FAD and KL to represent generated audios' quality. The experimennt is carried on UnAV100 test dataset. Different splitting durations mean different sub-videos' and generated sub-audios' durations per inference.