MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection
Da Mu, Zhicheng Zhang, Haobo Yue
TL;DR
This paper addresses SELD by integrating a Multi-scale Feature Fusion (MFF) module into the Event-Independent Network V2 (EINV2) to jointly exploit spectral, spatial, and temporal cues. The MFF module uses three parallel subnetworks to generate multi-scale spectral and spatial features and a TF-Convolution Module (TFCM) to capture multi-scale temporal dynamics, with repeated fusion across subnetworks to enrich representations. The resulting MFF-EINV2 achieves state-of-the-art SELD performance on STARSS22 and STARSS23 benchmarks while reducing model parameters by 68.5% and improving SELD_score by 18.2% relative to EINV2, without data augmentation. This approach provides a scalable, efficient framework for robust SELD in multi-channel recordings, with strong implications for real-world acoustic scene understanding and DoA estimation.
Abstract
Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD. However, it still faces challenges in effectively extracting features across spectral, spatial, and temporal domains. This paper proposes a three-stage network structure named Multi-scale Feature Fusion (MFF) module to fully extract multi-scale features across spectral, spatial, and temporal domains. The MFF module utilizes parallel subnetworks architecture to generate multi-scale spectral and spatial features. The TF-Convolution Module is employed to provide multi-scale temporal features. We incorporated MFF into EINV2 and term the proposed method as MFF-EINV2. Experimental results in 2022 and 2023 DCASE challenge task3 datasets show the effectiveness of our MFF-EINV2, which achieves state-of-the-art (SOTA) performance compared to published methods.
