$^R$FLAV: Rolling Flow matching for infinite Audio Video generation
Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
TL;DR
This work tackles joint audio–video generation with unconstrained duration by introducing $^R$FLAV, a rolling rectified-flow transformer architecture that decouples video and audio processing into two branches and uses a lightweight cross-modality fusion strategy. The model uses a rolling diffusion scheme with pre-rolling and rolling phases to sustain high quality over long sequences, coupled with Flow Matching to learn efficient trajectories between noise and data. Key contributions include a novel lightweight cross-modality interaction block, an AV encoding scheme that preserves frame-wise alignment, and extensive ablations and comparisons showing state-of-the-art performance on Landscape and AIST++ while enabling infinite AV generation. The approach achieves strong multimodal synchronization, coherent long-duration generation, and competitive audio quality, with open-source code enabling reproducibility and further research.
Abstract
Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present $^R$-FLAV, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that $^R$-FLAV outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.
