TTMBA: Towards Text To Multiple Sources Binaural Audio Generation
Yuxuan He, Xiaoran Yang, Ningning Pan, Gongping Huang
TL;DR
This work tackles the limitation of mono outputs in text-to-audio generation by introducing TTMBA, a cascaded pipeline that adds temporal and spatial control to multisource binaural audio. It combines GPT-4o based extraction of structured sound events, TangoFlux-driven mono audio generation, and a Fourier-domain binaural renderer that predicts per-frame magnitude and phase adjustments conditioned on 3D source positions, followed by WOLA synthesis. The approach yields a first-of-its-kind text-conditioned binaural system with controllable duration, start time, and location, and demonstrates superior performance on both mono-audio quality and spatial perception benchmarks, while maintaining low computational cost. Practically, TTMBA enables immersive VR/AR/audio-visual experiences with accurate source localization and flexible timing, using a lightweight yet effective architecture validated against credible baselines and datasets.
Abstract
Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.
