Description and analysis of novelties introduced in DCASE Task 4 2022 on the baseline system
Francesca Ronchini, Samuele Cornell, Romain Serizel, Nicolas Turpault, Eduardo Fonseca, Daniel P. W. Ellis
TL;DR
The paper addresses robust sound event detection and localization in domestic environments under heterogeneous, partially labeled data by introducing three novelties for DCASE Task 4 2022: external datasets with AudioSet, open-source pretrained embeddings, and CodeCarbon-based energy benchmarking. It evaluates these novelties on a CRNN baseline, augmented with either PANNs or AST embeddings, and analyzes their impact on PSDS-1 (localization accuracy) and PSDS-2 (classification accuracy). Key findings show that real-world strong labels from AudioSet improve PSDS-1, while pretrained embeddings primarily boost PSDS-2, with the embedding layer choice significantly affecting gains; energy consumption differs notably across configurations, and the plain baseline remains highly competitive in overall performance. The work highlights the importance of data sources and energy-aware benchmarking for practical deployment, and suggests directions for fair comparisons and more efficient fusion of pretrained representations in SED systems.
Abstract
The aim of the Detection and Classification of Acoustic Scenes and Events Challenge Task 4 is to evaluate systems for the detection of sound events in domestic environments using an heterogeneous dataset. The systems need to be able to correctly detect the sound events present in a recorded audio clip, as well as localize the events in time. This year's task is a follow-up of DCASE 2021 Task 4, with some important novelties. The goal of this paper is to describe and motivate these new additions, and report an analysis of their impact on the baseline system. We introduced three main novelties: the use of external datasets, including recently released strongly annotated clips from Audioset, the possibility of leveraging pre-trained models, and a new energy consumption metric to raise awareness about the ecological impact of training sound events detectors. The results on the baseline system show that leveraging open-source pretrained on AudioSet improves the results significantly in terms of event classification but not in terms of event segmentation.
