Table of Contents
Fetching ...

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

Da Mu, Zhicheng Zhang, Haobo Yue

TL;DR

This paper addresses SELD by integrating a Multi-scale Feature Fusion (MFF) module into the Event-Independent Network V2 (EINV2) to jointly exploit spectral, spatial, and temporal cues. The MFF module uses three parallel subnetworks to generate multi-scale spectral and spatial features and a TF-Convolution Module (TFCM) to capture multi-scale temporal dynamics, with repeated fusion across subnetworks to enrich representations. The resulting MFF-EINV2 achieves state-of-the-art SELD performance on STARSS22 and STARSS23 benchmarks while reducing model parameters by 68.5% and improving SELD_score by 18.2% relative to EINV2, without data augmentation. This approach provides a scalable, efficient framework for robust SELD in multi-channel recordings, with strong implications for real-world acoustic scene understanding and DoA estimation.

Abstract

Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD. However, it still faces challenges in effectively extracting features across spectral, spatial, and temporal domains. This paper proposes a three-stage network structure named Multi-scale Feature Fusion (MFF) module to fully extract multi-scale features across spectral, spatial, and temporal domains. The MFF module utilizes parallel subnetworks architecture to generate multi-scale spectral and spatial features. The TF-Convolution Module is employed to provide multi-scale temporal features. We incorporated MFF into EINV2 and term the proposed method as MFF-EINV2. Experimental results in 2022 and 2023 DCASE challenge task3 datasets show the effectiveness of our MFF-EINV2, which achieves state-of-the-art (SOTA) performance compared to published methods.

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

TL;DR

This paper addresses SELD by integrating a Multi-scale Feature Fusion (MFF) module into the Event-Independent Network V2 (EINV2) to jointly exploit spectral, spatial, and temporal cues. The MFF module uses three parallel subnetworks to generate multi-scale spectral and spatial features and a TF-Convolution Module (TFCM) to capture multi-scale temporal dynamics, with repeated fusion across subnetworks to enrich representations. The resulting MFF-EINV2 achieves state-of-the-art SELD performance on STARSS22 and STARSS23 benchmarks while reducing model parameters by 68.5% and improving SELD_score by 18.2% relative to EINV2, without data augmentation. This approach provides a scalable, efficient framework for robust SELD in multi-channel recordings, with strong implications for real-world acoustic scene understanding and DoA estimation.

Abstract

Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD. However, it still faces challenges in effectively extracting features across spectral, spatial, and temporal domains. This paper proposes a three-stage network structure named Multi-scale Feature Fusion (MFF) module to fully extract multi-scale features across spectral, spatial, and temporal domains. The MFF module utilizes parallel subnetworks architecture to generate multi-scale spectral and spatial features. The TF-Convolution Module is employed to provide multi-scale temporal features. We incorporated MFF into EINV2 and term the proposed method as MFF-EINV2. Experimental results in 2022 and 2023 DCASE challenge task3 datasets show the effectiveness of our MFF-EINV2, which achieves state-of-the-art (SOTA) performance compared to published methods.
Paper Structure (13 sections, 1 equation, 2 figures, 3 tables)

This paper contains 13 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An illustration of (a) the MFF-EINV2 pipeline diagram and (b) the details of the MFF module. (a) The architecture of the Conformer blocks and FC layer aligns with the EINV2 and the green boxes indicate soft parameter sharing. (b) C, T, and F are the dimension sizes of channel, time, and frequency, respectively. “1×", “4×", and “16×" represent different scales of feature maps. “ freq down." and “ freq up." refer to frequency downsampling (FD) and frequency upsampling (FU), respectively.
  • Figure 2: An illustration of the convolution operation on (a) a high-resolution frequency feature map from the first subnetwork and (b) a low-resolution frequency feature map from the second subnetwork. The pink boxes indicate the convolution kernel of D-Conv in TFCM. A green time-frequency (T-F) bin contains the information of seven consecutive blue T-F bins.