MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

Da Mu; Zhicheng Zhang; Haobo Yue

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

Da Mu, Zhicheng Zhang, Haobo Yue

TL;DR

This paper addresses SELD by integrating a Multi-scale Feature Fusion (MFF) module into the Event-Independent Network V2 (EINV2) to jointly exploit spectral, spatial, and temporal cues. The MFF module uses three parallel subnetworks to generate multi-scale spectral and spatial features and a TF-Convolution Module (TFCM) to capture multi-scale temporal dynamics, with repeated fusion across subnetworks to enrich representations. The resulting MFF-EINV2 achieves state-of-the-art SELD performance on STARSS22 and STARSS23 benchmarks while reducing model parameters by 68.5% and improving SELD_score by 18.2% relative to EINV2, without data augmentation. This approach provides a scalable, efficient framework for robust SELD in multi-channel recordings, with strong implications for real-world acoustic scene understanding and DoA estimation.

Abstract

Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD. However, it still faces challenges in effectively extracting features across spectral, spatial, and temporal domains. This paper proposes a three-stage network structure named Multi-scale Feature Fusion (MFF) module to fully extract multi-scale features across spectral, spatial, and temporal domains. The MFF module utilizes parallel subnetworks architecture to generate multi-scale spectral and spatial features. The TF-Convolution Module is employed to provide multi-scale temporal features. We incorporated MFF into EINV2 and term the proposed method as MFF-EINV2. Experimental results in 2022 and 2023 DCASE challenge task3 datasets show the effectiveness of our MFF-EINV2, which achieves state-of-the-art (SOTA) performance compared to published methods.

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 2 figures, 3 tables)

This paper contains 13 sections, 1 equation, 2 figures, 3 tables.

Introduction
Proposed Method
Parallel Multi-resolution Subnetworks
TF-Convolution Module
Repeated Multi-scale Fusion
Experiments
Datasets
Hyper-parameters and Evaluation Metrics
Experimental Results
Comparison with other methods
Number of parallel subnetworks
Number of convolutional blocks in TFCM
Conclusion

Figures (2)

Figure 1: An illustration of (a) the MFF-EINV2 pipeline diagram and (b) the details of the MFF module. (a) The architecture of the Conformer blocks and FC layer aligns with the EINV2 and the green boxes indicate soft parameter sharing. (b) C, T, and F are the dimension sizes of channel, time, and frequency, respectively. “1×", “4×", and “16×" represent different scales of feature maps. “ freq down." and “ freq up." refer to frequency downsampling (FD) and frequency upsampling (FU), respectively.
Figure 2: An illustration of the convolution operation on (a) a high-resolution frequency feature map from the first subnetwork and (b) a low-resolution frequency feature map from the second subnetwork. The pink boxes indicate the convolution kernel of D-Conv in TFCM. A green time-frequency (T-F) bin contains the information of seven consecutive blue T-F bins.

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

TL;DR

Abstract

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)