Table of Contents
Fetching ...

Sound event localization and classification using WASN in Outdoor Environment

Dongzhe Zhang, Jianfeng Chen, Jisheng Bai, Mou Wang, Dongyuan Shi, Qixiang Niu, Alberto Bernardini

TL;DR

The paper tackles outdoor sound event localization and classification by leveraging a multi-array WASN and a multitask CNN-Transformer model that fuses Soundmap, GTGram, and array coordinate features. It introduces novel soundmap and GTGram representations and a joint loss to simultaneously estimate location and class, achieving state-of-the-art SELC performance in simulated and real-world outdoor environments. The approach demonstrates robust performance across varying noise levels and array configurations, with practical edge-computing deployment and synchronized timing. This work advances scalable, accurate, and robust outdoor acoustic sensing for applications like wildlife monitoring and public safety.

Abstract

Deep learning-based sound event localization and classification is an emerging research area within wireless acoustic sensor networks. However, current methods for sound event localization and classification typically rely on a single microphone array, making them susceptible to signal attenuation and environmental noise, which limits their monitoring range. Moreover, methods using multiple microphone arrays often focus solely on source localization, neglecting the aspect of sound event classification. In this paper, we propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source. We introduce a Soundmap feature to capture spatial information across multiple frequency bands. We also use the Gammatone filter to generate acoustic features more suitable for outdoor environments. Furthermore, we integrate attention mechanisms to learn channel-wise relationships and temporal dependencies within the acoustic features. To evaluate our proposed method, we conduct experiments using simulated datasets with different levels of noise and size of monitoring areas, as well as different arrays and source positions. The experimental results demonstrate the superiority of our proposed method over state-of-the-art methods in both sound event classification and sound source localization tasks. And we provide further analysis to explain the reasons for the observed errors.

Sound event localization and classification using WASN in Outdoor Environment

TL;DR

The paper tackles outdoor sound event localization and classification by leveraging a multi-array WASN and a multitask CNN-Transformer model that fuses Soundmap, GTGram, and array coordinate features. It introduces novel soundmap and GTGram representations and a joint loss to simultaneously estimate location and class, achieving state-of-the-art SELC performance in simulated and real-world outdoor environments. The approach demonstrates robust performance across varying noise levels and array configurations, with practical edge-computing deployment and synchronized timing. This work advances scalable, accurate, and robust outdoor acoustic sensing for applications like wildlife monitoring and public safety.

Abstract

Deep learning-based sound event localization and classification is an emerging research area within wireless acoustic sensor networks. However, current methods for sound event localization and classification typically rely on a single microphone array, making them susceptible to signal attenuation and environmental noise, which limits their monitoring range. Moreover, methods using multiple microphone arrays often focus solely on source localization, neglecting the aspect of sound event classification. In this paper, we propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source. We introduce a Soundmap feature to capture spatial information across multiple frequency bands. We also use the Gammatone filter to generate acoustic features more suitable for outdoor environments. Furthermore, we integrate attention mechanisms to learn channel-wise relationships and temporal dependencies within the acoustic features. To evaluate our proposed method, we conduct experiments using simulated datasets with different levels of noise and size of monitoring areas, as well as different arrays and source positions. The experimental results demonstrate the superiority of our proposed method over state-of-the-art methods in both sound event classification and sound source localization tasks. And we provide further analysis to explain the reasons for the observed errors.
Paper Structure (19 sections, 13 equations, 8 figures, 9 tables)

This paper contains 19 sections, 13 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: WASN system where A1, A2, and A3 represent array nodes, while S represents the target sound source. Array nodes can collect and process multi-channel audio signals, extract various features from the signals, and transmit these features to the central node. The central node collects feature data from multiple array nodes and uses a neural network to estimate the class and locations of the target source.
  • Figure 2: The model architecture of the proposed method.
  • Figure 3: Comparison of confusion matrices obtained by the proposed method with and without attention mechanisms.
  • Figure 4: A typical case of the system. There are 5 array nodes, with one target sound source ($\textit{siren}$) and one interfering sound source ($\textit{dog bark}$) active simultaneously. The dashed lines on the soundmap feature represent the estimated DOA of the target sound source relative to the array nodes. A1, A2, A3, and A4 can offer relatively precise estimates of direction (green dashed line), while A5, being too close to the interfering sound source, provides inaccurate direction estimates (red dashed line). The localization results of SSL-PLSE koks2001passive and SSL-FUZZY faraji2019sound are strongly impacted by A5, with RMSE values of 22.1 m and 13.3 m, respectively. By contrast, DL-based methods exhibit reduced sensitivity to A5: SSL-STFT le2019learning and SSL-SOFT feng2023soft achieve RMSE of 7.3 m and 7.8 m, and our proposed method obtains a notably lower error of 3.7 m.
  • Figure 5: Training and validation loss curves over 50 epochs, showing rapid improvement in the first 20 epochs followed by gradual convergence.
  • ...and 3 more figures