FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels
Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das
TL;DR
The paper tackles sound event detection under DCASE 2024 Task 4 by addressing domain shifts from heterogeneous training data and missing labels. It proposes a domain-generalized framework that fuses BEATs-based frame embeddings with a frequency-dynamic CRNN (FDY-CRNN), enhanced by freqwise MixStyle, an independent dataset-specific loss, and sound event bounding box (SEBB) post-processing. Key contributions include the integration of MixStyle in both internal features and mel-spectrograms, the per-dataset loss design, and the SEBB-based event-level post-processing, validated on DESED and MAESTRO Real data with notable gains in PSDS and mPAUC. The approach demonstrates improved cross-domain generalization and robust SED performance, with an ensemble achieving leading results on public evaluation metrics, highlighting practical impact for real-world multi-domain audio event localization.
Abstract
This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging to achieve good performance without knowing the source of the audio clips during evaluation. To address this, we propose a sound event detection method using domain generalization. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We focus on three main strategies to improve our method. First, we apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Second, we consider training loss of our model specific to each datasets for their corresponding classes. This independent learning framework helps the model extract domain-specific features effectively. Lastly, we use the sound event bounding boxes method for post-processing. Our proposed method shows superior macro-average pAUC and polyphonic SED score performance on the DCASE 2024 Challenge Task 4 validation dataset and public evaluation dataset.
