Table of Contents
Fetching ...

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Peiwen Sun, Sitong Cheng, Xiangtai Li, Zhen Ye, Huadai Liu, Honggang Zhang, Wei Xue, Yike Guo

TL;DR

This work tackles the challenge of generating realistic, spatially controllable stereo audio guided by multimodal context. It introduces SpatialSonic, a one-stage diffusion framework, and BEWO-1M, a large-scale simulation-driven dataset with multimodal captions and moving/multi-source scenes. SpatialSonic fuses text and region-aware image embeddings with an azimuth state matrix to provide coarse-to-fine spatial guidance, achieving superior objective and subjective metrics for text-to-audio and image-to-audio generation, while reducing error in interaural cues. The approach enables immersive spatial audio generation with potential impact on AR/VR and embodied AI, and lays groundwork for future expansion to 5.1-channel data and video-guided spatial audio.

Abstract

Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for Latent Diffusion Models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our model not only achieves the objective of generating immersive and controllable spatial audio from text but also extends to other modalities as the pioneer attempt. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

TL;DR

This work tackles the challenge of generating realistic, spatially controllable stereo audio guided by multimodal context. It introduces SpatialSonic, a one-stage diffusion framework, and BEWO-1M, a large-scale simulation-driven dataset with multimodal captions and moving/multi-source scenes. SpatialSonic fuses text and region-aware image embeddings with an azimuth state matrix to provide coarse-to-fine spatial guidance, achieving superior objective and subjective metrics for text-to-audio and image-to-audio generation, while reducing error in interaural cues. The approach enables immersive spatial audio generation with potential impact on AR/VR and embodied AI, and lays groundwork for future expansion to 5.1-channel data and video-guided spatial audio.

Abstract

Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for Latent Diffusion Models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our model not only achieves the objective of generating immersive and controllable spatial audio from text but also extends to other modalities as the pioneer attempt. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.

Paper Structure

This paper contains 59 sections, 15 equations, 17 figures, 33 tables.

Figures (17)

  • Figure 1: Our SpatialSonic, as a one-stage model, alleviates the problem of error accumulation in a two-stage model and facilitates control with end-to-end finetuning in a one-stage model. Moreover, our spatially enhanced system supports spatial audio generation from text and image, as well as interactive actions.
  • Figure 2: The pipeline of BEWO-1M data collection. The data machine is driven by LLM induction and rigorous simulation. In particular, the data for testing are built with human checking. The diagram in Step 3 represents one of the simulation scenarios. (a) illustrates the diversity of source and microphone positions. (b-f) show the abundant soundscapes in BEWO-1M.
  • Figure 3: In polar coordinates, the radius represents normalized audio energy, the angle denotes the perception angle (0° for right, 180° for left), and five colors in the above legend signify the use of common directional terms to describe sound events. (a) is the human perception based on the questionnaire of volunteers. In (b), the baseline fails to generate the controllable audio. Obviously, (c) highlights the valuable knowledge from BEWO-1M. (d,e) highlights the superiority of our data and methods in controlling the generation of 5 common directions and uniform fine-controlling matrices.
  • Figure 4: The overall pipeline of SpatialSonic. It is a one-stage controllable model that processes multimodal inputs to generate spatial audio, where GPT is used to inject the specific azimuth state into the guidance.
  • Figure C5: Jaccard similarity between raw descriptions and our rewritten captions in the single static subset. We follow Mei_2024 to conduct this analysis to show a generally low level of lexical overlap across various sources.
  • ...and 12 more figures