Table of Contents
Fetching ...

AudioSpa: Spatializing Sound Events with Text

Linfeng Feng, Lei Zhao, Boyu Zhu, Xiao-Lei Zhang, Xuelong Li

TL;DR

AudioSpa tackles text-guided binaural spatial audio generation using a monaural reference. The approach fuses text-derived conditioning via a fusion multi-head attention mechanism into a time-domain 1D convolutional backbone, producing binaural outputs that match text-specified directions. A dedicated binaural localization model provides objective spatial evaluation, while on-the-fly data augmentation creates diverse text–audio–location pairs from single-source data. Results show strong localization accuracy and perceptual quality in single-source scenarios, with ablations confirming the value of FMHA fusion and augmentation. This work advances text-to-spatial audio by integrating large language models, multimodal fusion, and dynamic data synthesis, offering a path toward more immersive, text-controlled auditory experiences.

Abstract

Text-to-audio (TTA) systems have recently demonstrated strong performance in synthesizing monaural audio from text. However, the task of generating binaural spatial audio from text, which provides a more immersive auditory experience by incorporating the sense of spatiality, have not been explored yet. In this work, we introduce text-guided binaural audio generation. As an early effort, we focus on the scenario where a monaural reference audio is given additionally. The core problem is to associate specific sound events with their directions, thereby creating binaural spatial audio. The challenge lies in the complexity of textual descriptions and the limited availability of single-source sound event datasets. To address this, we propose AudioSpa, an end-to-end model that applies large language models to process both acoustic and textual information. We employ fusion multi-head attention (FMHA) to integrate text tokens, which enhances the generation capability of the multimodal learning. Additionally, we propose a binaural source localization model to assess the quality of the generated audio. Finally, we design a data augmentation strategy to generate diverse datasets, which enables the model to spatialize sound events across various spatial positions. Experimental results demonstrate that our model is able to put sounds at the specified locations accurately. It achieves competitive performance in both localization accuracy and signal distortion. Our demonstrations are available at https://linfeng-feng.github.io/AudioSpa-demo.

AudioSpa: Spatializing Sound Events with Text

TL;DR

AudioSpa tackles text-guided binaural spatial audio generation using a monaural reference. The approach fuses text-derived conditioning via a fusion multi-head attention mechanism into a time-domain 1D convolutional backbone, producing binaural outputs that match text-specified directions. A dedicated binaural localization model provides objective spatial evaluation, while on-the-fly data augmentation creates diverse text–audio–location pairs from single-source data. Results show strong localization accuracy and perceptual quality in single-source scenarios, with ablations confirming the value of FMHA fusion and augmentation. This work advances text-to-spatial audio by integrating large language models, multimodal fusion, and dynamic data synthesis, offering a path toward more immersive, text-controlled auditory experiences.

Abstract

Text-to-audio (TTA) systems have recently demonstrated strong performance in synthesizing monaural audio from text. However, the task of generating binaural spatial audio from text, which provides a more immersive auditory experience by incorporating the sense of spatiality, have not been explored yet. In this work, we introduce text-guided binaural audio generation. As an early effort, we focus on the scenario where a monaural reference audio is given additionally. The core problem is to associate specific sound events with their directions, thereby creating binaural spatial audio. The challenge lies in the complexity of textual descriptions and the limited availability of single-source sound event datasets. To address this, we propose AudioSpa, an end-to-end model that applies large language models to process both acoustic and textual information. We employ fusion multi-head attention (FMHA) to integrate text tokens, which enhances the generation capability of the multimodal learning. Additionally, we propose a binaural source localization model to assess the quality of the generated audio. Finally, we design a data augmentation strategy to generate diverse datasets, which enables the model to spatialize sound events across various spatial positions. Experimental results demonstrate that our model is able to put sounds at the specified locations accurately. It achieves competitive performance in both localization accuracy and signal distortion. Our demonstrations are available at https://linfeng-feng.github.io/AudioSpa-demo.

Paper Structure

This paper contains 33 sections, 12 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The model architecture of AudioSpa, which takes input text and monaural audio and outputs binaural audio in an end-to-end manner. For simplicity, we omit the activation functions.
  • Figure 2: The architecture of the binaural localization model. For simplicity, we omit the activation functions.
  • Figure 3: Ablation study of the data augmentation on the validation dataset.
  • Figure 4: A case study of spatializing specific sound events. Each spectrogram's x-axis represents time, the y-axis represents frequency. Baseline is DSP, SNR is 10dB, and the audio duration is 4 seconds. In the binaural audio, the target sound is positioned directly to the left.