Effective Pre-Training of Audio Transformers for Sound Event Detection
Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schlüter, Paul Primus, Gerhard Widmer
TL;DR
The paper tackles frame-level sound event detection by enhancing pre-training of audio transformers with AudioSet Strong supervision, balanced sampling, and ensemble knowledge distillation. It introduces a three-step pipeline (self-supervised/ImageNet pretraining, AudioSet Weak finetuning, and AudioSet Strong knowledge distillation) and demonstrates substantial PSDS1 gains across five transformer architectures, culminating in a strong ensemble around 47.1 PSDS1 on AudioSet Strong. Transferability is evaluated on downstream tasks (DESED, DC16-T2, MAESTRO) under frozen and finetuned regimes, with the best in-domain performance achieved when the AudioSet ontology aligns well with the tasks. Public checkpoints are released to accelerate research in temporally precise audio understanding and related applications such as audio captioning, grounding, and retrieval.
Abstract
We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.
