Location-Oriented Sound Event Localization and Detection with Spatial Mapping and Regression Localization
Xueping Zhang, Yaxiong Chen, Ruilin Yao, Yunfei Zi, Shengwu Xiong
TL;DR
This work tackles SELD under unknown maximum polyphony by introducing SMRL-SELD, a location-oriented framework that maps the 3D spatial space to a 2D grid and learns to regularize localization via a regression loss with three components: $L_{Class\text{-}MSE}$, $L_{AIUR}$, and $L_{CL}$. The method employs a CSPDarkNet53-based backbone to produce multi-scale features from FOA input and predicts a per-frame, per-grid, per-class probability map $\hat{y}$, which is then trained with the proposed loss to converge toward true locations. Experiments on STARSS23 and STARSS22 show SMRL-SELD outperforms state-of-the-art SELD methods, especially in polyphonic scenarios, and an ablation study confirms the contribution of each loss term and the benefit of a 15° spatial grid. The approach enhances generality and robustness for real-world, multi-source audio localization and detection tasks with unknown polyphony, offering practical gains for surveillance, biodiversity monitoring, and context-aware devices.
Abstract
Sound Event Localization and Detection (SELD) combines the Sound Event Detection (SED) with the corresponding Direction Of Arrival (DOA). Recently, adopted event oriented multi-track methods affect the generality in polyphonic environments due to the limitation of the number of tracks. To enhance the generality in polyphonic environments, we propose Spatial Mapping and Regression Localization for SELD (SMRL-SELD). SMRL-SELD segments the 3D spatial space, mapping it to a 2D plane, and a new regression localization loss is proposed to help the results converge toward the location of the corresponding event. SMRL-SELD is location-oriented, allowing the model to learn event features based on orientation. Thus, the method enables the model to process polyphonic sounds regardless of the number of overlapping events. We conducted experiments on STARSS23 and STARSS22 datasets and our proposed SMRL-SELD outperforms the existing SELD methods in overall evaluation and polyphony environments.
