Table of Contents
Fetching ...

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao

TL;DR

Frieren introduces rectified flow matching for video-to-audio generation, converting noise to spectrogram latents via a transport vector field learned to follow straight paths and sampled by solving an ODE. A non-autoregressive vector field estimator with cross-modal fusion and reflow/distillation enables high audio quality and strong video–audio synchrony, while classifier-free guidance further improves conditioning. On VGGSound, Frieren achieves state-of-the-art alignment and perceptual quality, with up to 9.3x faster single-step sampling and 6.2% better inception score, demonstrating strong efficiency gains alongside superior fidelity. The work advances V2A by combining rectified flow principles with practical engineering for real-time or high-throughput audio synthesis in aligned multimedia applications.

Abstract

Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io.

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

TL;DR

Frieren introduces rectified flow matching for video-to-audio generation, converting noise to spectrogram latents via a transport vector field learned to follow straight paths and sampled by solving an ODE. A non-autoregressive vector field estimator with cross-modal fusion and reflow/distillation enables high audio quality and strong video–audio synchrony, while classifier-free guidance further improves conditioning. On VGGSound, Frieren achieves state-of-the-art alignment and perceptual quality, with up to 9.3x faster single-step sampling and 6.2% better inception score, demonstrating strong efficiency gains alongside superior fidelity. The work advances V2A by combining rectified flow principles with practical engineering for real-time or high-throughput audio synthesis in aligned multimedia applications.

Abstract

Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io.
Paper Structure (36 sections, 9 equations, 7 figures, 9 tables)

This paper contains 36 sections, 9 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Illustration of the sampling process of our rectified-flow based V2A architecture.
  • Figure 2: Illustration of model architecture of Frieren at different levels.
  • Figure 3: One-step generation results of different models. (a): The content of the input video is a woman playing the violin. (b): Diff-Foley generates meaningless audio with one step. (c, d): LDM and Frieren without reflow generate highly noisy audio. (e, f): reflow enables Frieren to generate meaningful audio in one step, and distillation further improves the one-step generation quality.
  • Figure 4: IS and FAD of the models with different steps.
  • Figure 5: Model performance of Frieren under different CFG scales.
  • ...and 2 more figures