MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Zehao Wang; Haobo Yue; Zhicheng Zhang; Da Mu; Jin Tang; Jianqin Yin

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Zehao Wang, Haobo Yue, Zhicheng Zhang, Da Mu, Jin Tang, Jianqin Yin

TL;DR

A novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED) is introduced, which employs the Mutual-Assistance Audio Adapter to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion module to tackle the multi-granularity problem.

Abstract

Sound Event Detection (SED) plays a vital role in comprehending and perceiving acoustic scenes. Previous methods have demonstrated impressive capabilities. However, they are deficient in learning features of complex scenes from heterogeneous dataset. In this paper, we introduce a novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED). The MTDA-HSED architecture employs the Mutual-Assistance Audio Adapter (M3A) to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion (DBMF) module to tackle the multi-granularity problem. Specifically, M3A is integrated into the BEATs block as an adapter to improve the BEATs' performance by fine-tuning it on the multi-scenario dataset. The DBMF module connects BEATs and CNN branches, which facilitates the deep fusion of information from the BEATs and the CNN branches. Experimental results show that the proposed methods exceed the baseline of mpAUC by \textbf{$5\%$} on the DESED and MAESTRO Real datasets. Code is available at https://github.com/Visitor-W/MTDA.

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

TL;DR

Abstract

} on the DESED and MAESTRO Real datasets. Code is available at https://github.com/Visitor-W/MTDA.

Paper Structure (13 sections, 10 equations, 3 figures, 4 tables)

This paper contains 13 sections, 10 equations, 3 figures, 4 tables.

Introduction
Methodology
Mutual-Assistance Audio Adapter
Dual-Branch Mid-Fusion Module
Experiment
Implementation Details
Experiment Results
Performance Comparison
Ablation Study
Impact of audio adapter number and projection dimension in M3A
Impact of aggregate strategy
Qualitative Study
Conclusion

Figures (3)

Figure 1: Architecture Overview of the MTDA-HSED: We provide a comprehensive visualization of the MTDA-HSED, divided into two main components. Fig 1(a) illustrates the M3A module, showcasing the integration of Long-Term Audio Adapter and Short-Term Audio Adapter within the BEATs block’s FFN. Fig 1(b) depicts the dual-branch pipeline comprising the BEATs, CRNN, and DBMF module.
Figure 2: Details of the DBMF Module
Figure 3: Visualization of the M3A modules: The first line is the spectrogram of the input. The second line is the output of the Long-Term Audio Adapter. The third line is the output of the Short-Term Audio Adapter.

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

TL;DR

Abstract

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)