Table of Contents
Fetching ...

Location-Oriented Sound Event Localization and Detection with Spatial Mapping and Regression Localization

Xueping Zhang, Yaxiong Chen, Ruilin Yao, Yunfei Zi, Shengwu Xiong

TL;DR

This work tackles SELD under unknown maximum polyphony by introducing SMRL-SELD, a location-oriented framework that maps the 3D spatial space to a 2D grid and learns to regularize localization via a regression loss with three components: $L_{Class\text{-}MSE}$, $L_{AIUR}$, and $L_{CL}$. The method employs a CSPDarkNet53-based backbone to produce multi-scale features from FOA input and predicts a per-frame, per-grid, per-class probability map $\hat{y}$, which is then trained with the proposed loss to converge toward true locations. Experiments on STARSS23 and STARSS22 show SMRL-SELD outperforms state-of-the-art SELD methods, especially in polyphonic scenarios, and an ablation study confirms the contribution of each loss term and the benefit of a 15° spatial grid. The approach enhances generality and robustness for real-world, multi-source audio localization and detection tasks with unknown polyphony, offering practical gains for surveillance, biodiversity monitoring, and context-aware devices.

Abstract

Sound Event Localization and Detection (SELD) combines the Sound Event Detection (SED) with the corresponding Direction Of Arrival (DOA). Recently, adopted event oriented multi-track methods affect the generality in polyphonic environments due to the limitation of the number of tracks. To enhance the generality in polyphonic environments, we propose Spatial Mapping and Regression Localization for SELD (SMRL-SELD). SMRL-SELD segments the 3D spatial space, mapping it to a 2D plane, and a new regression localization loss is proposed to help the results converge toward the location of the corresponding event. SMRL-SELD is location-oriented, allowing the model to learn event features based on orientation. Thus, the method enables the model to process polyphonic sounds regardless of the number of overlapping events. We conducted experiments on STARSS23 and STARSS22 datasets and our proposed SMRL-SELD outperforms the existing SELD methods in overall evaluation and polyphony environments.

Location-Oriented Sound Event Localization and Detection with Spatial Mapping and Regression Localization

TL;DR

This work tackles SELD under unknown maximum polyphony by introducing SMRL-SELD, a location-oriented framework that maps the 3D spatial space to a 2D grid and learns to regularize localization via a regression loss with three components: , , and . The method employs a CSPDarkNet53-based backbone to produce multi-scale features from FOA input and predicts a per-frame, per-grid, per-class probability map , which is then trained with the proposed loss to converge toward true locations. Experiments on STARSS23 and STARSS22 show SMRL-SELD outperforms state-of-the-art SELD methods, especially in polyphonic scenarios, and an ablation study confirms the contribution of each loss term and the benefit of a 15° spatial grid. The approach enhances generality and robustness for real-world, multi-source audio localization and detection tasks with unknown polyphony, offering practical gains for surveillance, biodiversity monitoring, and context-aware devices.

Abstract

Sound Event Localization and Detection (SELD) combines the Sound Event Detection (SED) with the corresponding Direction Of Arrival (DOA). Recently, adopted event oriented multi-track methods affect the generality in polyphonic environments due to the limitation of the number of tracks. To enhance the generality in polyphonic environments, we propose Spatial Mapping and Regression Localization for SELD (SMRL-SELD). SMRL-SELD segments the 3D spatial space, mapping it to a 2D plane, and a new regression localization loss is proposed to help the results converge toward the location of the corresponding event. SMRL-SELD is location-oriented, allowing the model to learn event features based on orientation. Thus, the method enables the model to process polyphonic sounds regardless of the number of overlapping events. We conducted experiments on STARSS23 and STARSS22 datasets and our proposed SMRL-SELD outperforms the existing SELD methods in overall evaluation and polyphony environments.

Paper Structure

This paper contains 19 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The class and location of events occurring at frame $t$ in a multi-channel acoustic signal in a 3D space. The location is described by azimuth $\phi$ and elevation $\theta$.
  • Figure 2: Schematic of our location-oriented sound event localization and detection method, including three parts: Spatial Mapping, Network Structure, and Localization regression loss. $[\cdot , \cdot , ...]$ represents shape of the features.
  • Figure 3: The schematic representation depicts the influence of the asymptotic localization loss function on the model's predictions. The arrows in the diagram indicate the directionality of the guidance provided by the loss function, steering the model towards more accurate predictions.