Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Florian Schmid; Paul Primus; Tobias Morocutti; Jonathan Greif; Gerhard Widmer

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

TL;DR

This work tackles Sound Event Detection under heterogeneous Task 4 datasets (DESED and MAESTRO) with potentially missing labels. It presents a two-stage fine-tuning framework for three Audio Spectrogram Transformers (ATST, PaSST, BEATs), augmented by pre-training on AudioSet strong and a multi-iteration, pseudo-label–driven learning loop. A diverse ensemble generates strong pseudo-labels that substantially boost single-model performance, achieving state-of-the-art PSDS1 on the DESED public evaluation set (0.692) and notable gains on MAESTRO mpAUC. The results validate the effectiveness of cross-dataset augmentation, pseudo-labeling, and multi-model pretraining for robust SED in heterogeneous-label scenarios, with practical impact for real-world acoustic event detection systems.

Abstract

This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the large pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of all three fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, boosting single-model performance substantially. Additionally, we pre-train PaSST and ATST on the subset of AudioSet that comes with strong temporal labels, before fine-tuning them on the Task 4 datasets.

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

TL;DR

Abstract

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Authors

TL;DR

Abstract

Table of Contents

Figures (2)