Table of Contents
Fetching ...

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen, Shijian Deng, Kai Wang, Yunhui Guo, Yapeng Tian

Abstract

In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Abstract

In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/

Paper Structure

This paper contains 24 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the proposed Universal Holistic Audio Generation (UniHAGen) task. The example depicts the scenario of on-screen speech with off-screen environmental sound. Our model is able to generate audio that is temporally and semantically aligned with the video, consistent with the environmental captions, and faithful to the given speech transcription.
  • Figure 2: (A) Overview of our proposed OmniSonic, which mainly consists of an environmental text encoder (FLAN-T5), a speech transcription encoder (SpeechT5), a visual encoder (CLIP visual encoder), an audio VAE, and our specially designed TriAttn-DiT blocks. The input example demonstrates the scenario of on-screen speech with off-screen environmental sound. The input conditions include visual frames, speech transcription, an on-screen environmental sound caption (represented by a placeholder ""), and an off-screen environmental sound caption. (B) Details of our proposed TriAttn-DiT block.
  • Figure 3: Visualization of the spectrograms of generated audios and the ground-truth.
  • Figure 4: Ablation study on the MoE Gating module using in-the-wild samples. In this example, sounds are generated with different gating configurations. Top row: MoE Gating (full model); Middle row: reduced weight for the speech branch; Bottom row: reduced weight for the off-screen environmental branch.
  • Figure 5: Interface for the subjective evaluation.