Table of Contents
Fetching ...

$^R$FLAV: Rolling Flow matching for infinite Audio Video generation

Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati

TL;DR

This work tackles joint audio–video generation with unconstrained duration by introducing $^R$FLAV, a rolling rectified-flow transformer architecture that decouples video and audio processing into two branches and uses a lightweight cross-modality fusion strategy. The model uses a rolling diffusion scheme with pre-rolling and rolling phases to sustain high quality over long sequences, coupled with Flow Matching to learn efficient trajectories between noise and data. Key contributions include a novel lightweight cross-modality interaction block, an AV encoding scheme that preserves frame-wise alignment, and extensive ablations and comparisons showing state-of-the-art performance on Landscape and AIST++ while enabling infinite AV generation. The approach achieves strong multimodal synchronization, coherent long-duration generation, and competitive audio quality, with open-source code enabling reproducibility and further research.

Abstract

Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present $^R$-FLAV, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that $^R$-FLAV outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.

$^R$FLAV: Rolling Flow matching for infinite Audio Video generation

TL;DR

This work tackles joint audio–video generation with unconstrained duration by introducing FLAV, a rolling rectified-flow transformer architecture that decouples video and audio processing into two branches and uses a lightweight cross-modality fusion strategy. The model uses a rolling diffusion scheme with pre-rolling and rolling phases to sustain high quality over long sequences, coupled with Flow Matching to learn efficient trajectories between noise and data. Key contributions include a novel lightweight cross-modality interaction block, an AV encoding scheme that preserves frame-wise alignment, and extensive ablations and comparisons showing state-of-the-art performance on Landscape and AIST++ while enabling infinite AV generation. The approach achieves strong multimodal synchronization, coherent long-duration generation, and competitive audio quality, with open-source code enabling reproducibility and further research.

Abstract

Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present -FLAV, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that -FLAV outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.

Paper Structure

This paper contains 20 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An overview of our $^R$FLAV model architecture.
  • Figure 2: Temporal alignment between video frames and mel spectrogram segments. Each video frame corresponds to a fixed-size section (F/T) of the mel spectrogram, allowing for a 1:1 mapping.
  • Figure 3: a) Cross-modal interaction via self-attention, where origin=c]90$\ominus$ and $\ominus$ mean concatenation and split. b) Lightweight cross-modality interaction mechanism with temporal average modulation. c) Our final proposed $^R$FLAV block, an enhanced lightweight mechanism incorporating timestep embedding $t$ and optional class conditioning embedding $c$.
  • Figure 4: a) Rolling phase: at each step, a new clean frame is produced (highlighted in red) and subsequently removed from the window. Then, a new noisy frame, (highlighted in blue), is appended to the end of the window. b) Pre-rolling phase: the frames are gradually denoised starting from a full noise configuration. The pre-rolling phase goes on for $N$ steps, until the window is ready for the rolling phase.
  • Figure 5: AV metrics and feature drift calculated on long (i.e., 240 frames) generated videos using a sliding window of 16 frames.
  • ...and 2 more figures