Table of Contents
Fetching ...

TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

Yuxuan He, Xiaoran Yang, Ningning Pan, Gongping Huang

TL;DR

This work tackles the limitation of mono outputs in text-to-audio generation by introducing TTMBA, a cascaded pipeline that adds temporal and spatial control to multisource binaural audio. It combines GPT-4o based extraction of structured sound events, TangoFlux-driven mono audio generation, and a Fourier-domain binaural renderer that predicts per-frame magnitude and phase adjustments conditioned on 3D source positions, followed by WOLA synthesis. The approach yields a first-of-its-kind text-conditioned binaural system with controllable duration, start time, and location, and demonstrates superior performance on both mono-audio quality and spatial perception benchmarks, while maintaining low computational cost. Practically, TTMBA enables immersive VR/AR/audio-visual experiences with accurate source localization and flexible timing, using a lightweight yet effective architecture validated against credible baselines and datasets.

Abstract

Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.

TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

TL;DR

This work tackles the limitation of mono outputs in text-to-audio generation by introducing TTMBA, a cascaded pipeline that adds temporal and spatial control to multisource binaural audio. It combines GPT-4o based extraction of structured sound events, TangoFlux-driven mono audio generation, and a Fourier-domain binaural renderer that predicts per-frame magnitude and phase adjustments conditioned on 3D source positions, followed by WOLA synthesis. The approach yields a first-of-its-kind text-conditioned binaural system with controllable duration, start time, and location, and demonstrates superior performance on both mono-audio quality and spatial perception benchmarks, while maintaining low computational cost. Practically, TTMBA enables immersive VR/AR/audio-visual experiences with accurate source localization and flexible timing, using a lightweight yet effective architecture validated against credible baselines and datasets.

Abstract

Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.

Paper Structure

This paper contains 11 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Illustration of acoustic propagation models: (a) Conventional geometric acoustic simulation based solely on room impulse responses; (b) Binaural modeling incorporating listener-specific cues.
  • Figure 2: The framework of the proposed text-to-multisource binaural audio generation network.
  • Figure 3: Subjective evaluations of the generated binaural audio: (a) The comparison of MOS-P distribution and average scores across the four methods; (b) Percentage of correct answers in the direction perception test.