Table of Contents
Fetching ...

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Zehao Wang, Haobo Yue, Zhicheng Zhang, Da Mu, Jin Tang, Jianqin Yin

TL;DR

A novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED) is introduced, which employs the Mutual-Assistance Audio Adapter to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion module to tackle the multi-granularity problem.

Abstract

Sound Event Detection (SED) plays a vital role in comprehending and perceiving acoustic scenes. Previous methods have demonstrated impressive capabilities. However, they are deficient in learning features of complex scenes from heterogeneous dataset. In this paper, we introduce a novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED). The MTDA-HSED architecture employs the Mutual-Assistance Audio Adapter (M3A) to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion (DBMF) module to tackle the multi-granularity problem. Specifically, M3A is integrated into the BEATs block as an adapter to improve the BEATs' performance by fine-tuning it on the multi-scenario dataset. The DBMF module connects BEATs and CNN branches, which facilitates the deep fusion of information from the BEATs and the CNN branches. Experimental results show that the proposed methods exceed the baseline of mpAUC by \textbf{$5\%$} on the DESED and MAESTRO Real datasets. Code is available at https://github.com/Visitor-W/MTDA.

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

TL;DR

A novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED) is introduced, which employs the Mutual-Assistance Audio Adapter to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion module to tackle the multi-granularity problem.

Abstract

Sound Event Detection (SED) plays a vital role in comprehending and perceiving acoustic scenes. Previous methods have demonstrated impressive capabilities. However, they are deficient in learning features of complex scenes from heterogeneous dataset. In this paper, we introduce a novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED). The MTDA-HSED architecture employs the Mutual-Assistance Audio Adapter (M3A) to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion (DBMF) module to tackle the multi-granularity problem. Specifically, M3A is integrated into the BEATs block as an adapter to improve the BEATs' performance by fine-tuning it on the multi-scenario dataset. The DBMF module connects BEATs and CNN branches, which facilitates the deep fusion of information from the BEATs and the CNN branches. Experimental results show that the proposed methods exceed the baseline of mpAUC by \textbf{} on the DESED and MAESTRO Real datasets. Code is available at https://github.com/Visitor-W/MTDA.
Paper Structure (13 sections, 10 equations, 3 figures, 4 tables)

This paper contains 13 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Architecture Overview of the MTDA-HSED: We provide a comprehensive visualization of the MTDA-HSED, divided into two main components. Fig 1(a) illustrates the M3A module, showcasing the integration of Long-Term Audio Adapter and Short-Term Audio Adapter within the BEATs block’s FFN. Fig 1(b) depicts the dual-branch pipeline comprising the BEATs, CRNN, and DBMF module.
  • Figure 2: Details of the DBMF Module
  • Figure 3: Visualization of the M3A modules: The first line is the spectrogram of the input. The second line is the output of the Long-Term Audio Adapter. The third line is the output of the Short-Term Audio Adapter.