Table of Contents
Fetching ...

DOA Estimation with Lightweight Network on LLM-Aided Simulated Acoustic Scenes

Haowen Li, Zhengding Luo, Dongyuan Shi, Boxiang Wang, Junwei Ji, Ziyi Yang, Woon-Seng Gan

TL;DR

This work tackles robust DOA estimation under diverse, realistic conditions by leveraging the BEWO dataset, an LLM-assisted synthetic spatial-audio corpus, to overcome generalization gaps of traditional RIR-based data. It introduces LightDOA, a lightweight IPD-based network with depthwise separable convolutions and a GRU backend that classifies azimuth on a grid from $0^ o 180^ o$ with $5^ o 5$ degree spacing, achieving competitive accuracy with significantly fewer parameters ($ ext{about }3.9\times 10^{4}$). The method outperforms larger baselines across resolutions, highlighting the efficacy of compact spatial modeling and the value of diverse synthetic data for robust DOA learning. These results support real-time, edge-friendly DOA deployment and suggest that LLM-generated spatial data can meaningfully advance robust, efficient spatial audio processing.

Abstract

Direction-of-Arrival (DOA) estimation is critical in spatial audio and acoustic signal processing, with wide-ranging applications in real-world. Most existing DOA models are trained on synthetic data by convolving clean speech with room impulse responses (RIRs), which limits their generalizability due to constrained acoustic diversity. In this paper, we revisit DOA estimation using a recently introduced dataset constructed with the assistance of large language models (LLMs), which provides more realistic and diverse spatial audio scenes. We benchmark several representative neural-based DOA methods on this dataset and propose LightDOA, a lightweight DOA estimation model based on depthwise separable convolutions, specifically designed for mutil-channel input in varying environments. Experimental results show that LightDOA achieves satisfactory accuracy and robustness across various acoustic scenes while maintaining low computational complexity. This study not only highlights the potential of spatial audio synthesized with the assistance of LLMs in advancing robust and efficient DOA estimation research, but also highlights LightDOA as efficient solution for resource-constrained applications.

DOA Estimation with Lightweight Network on LLM-Aided Simulated Acoustic Scenes

TL;DR

This work tackles robust DOA estimation under diverse, realistic conditions by leveraging the BEWO dataset, an LLM-assisted synthetic spatial-audio corpus, to overcome generalization gaps of traditional RIR-based data. It introduces LightDOA, a lightweight IPD-based network with depthwise separable convolutions and a GRU backend that classifies azimuth on a grid from with degree spacing, achieving competitive accuracy with significantly fewer parameters (). The method outperforms larger baselines across resolutions, highlighting the efficacy of compact spatial modeling and the value of diverse synthetic data for robust DOA learning. These results support real-time, edge-friendly DOA deployment and suggest that LLM-generated spatial data can meaningfully advance robust, efficient spatial audio processing.

Abstract

Direction-of-Arrival (DOA) estimation is critical in spatial audio and acoustic signal processing, with wide-ranging applications in real-world. Most existing DOA models are trained on synthetic data by convolving clean speech with room impulse responses (RIRs), which limits their generalizability due to constrained acoustic diversity. In this paper, we revisit DOA estimation using a recently introduced dataset constructed with the assistance of large language models (LLMs), which provides more realistic and diverse spatial audio scenes. We benchmark several representative neural-based DOA methods on this dataset and propose LightDOA, a lightweight DOA estimation model based on depthwise separable convolutions, specifically designed for mutil-channel input in varying environments. Experimental results show that LightDOA achieves satisfactory accuracy and robustness across various acoustic scenes while maintaining low computational complexity. This study not only highlights the potential of spatial audio synthesized with the assistance of LLMs in advancing robust and efficient DOA estimation research, but also highlights LightDOA as efficient solution for resource-constrained applications.

Paper Structure

This paper contains 16 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Pipeline for Constructing the SS Subset from AudioCapsaudiocaps.
  • Figure 2: Angle distribution of the BEWO dataset. (a) shows the kernel density estimation (KDE) of raw DOA angles in the training set, where sparse regions (e.g., around 22°, 67°, 112°, and 157°) are visually identifiable. (b) shows the histogram of DOA angles after applying the front–back mapping across training, validation, and test sets, which are globally consistent but imbalanced within each set.
  • Figure 3: Overview of the proposed LightDOA architecture. The input IPD feature is processed by a stack of depthwise separable convolutional blocks, followed by temporal modeling and classification. The region enclosed in the orange dashed line indicates the overall architecture of the proposed LightDOA network. The blue dashed box illustrates the internal structure of a single depthwise separable convolutional block.