AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation
Le Wang, Jun Wang, Chunyu Qiang, Feng Deng, Chen Zhang, Di Zhang, Kun Gai
TL;DR
AudioGen-Omni tackles unified generation of diverse audio types conditioned on video and text inputs. It introduces a unified Multimodal Diffusion Transformer (MMDiT) with a lightweight lyrics-transcription encoder, AdaLN-based joint attention, and PAAPI for phase-consistent temporal alignment. The model unfreezes all modalities, masks missing inputs, and leverages conditional flow matching, enabling robust cross-modal conditioning and precise lip-sync across audio, speech, and singing. Trained on large-scale video-text-audio corpora, it achieves state-of-the-art results on audio, speech, and song generation tasks and demonstrates efficient inference (~1.91s for 8s of audio). This work lays the groundwork for broader multimodal generation, including potential extensions to video generation.
Abstract
We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and song coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both song and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
