Table of Contents
Fetching ...

Effective Pre-Training of Audio Transformers for Sound Event Detection

Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schlüter, Paul Primus, Gerhard Widmer

TL;DR

The paper tackles frame-level sound event detection by enhancing pre-training of audio transformers with AudioSet Strong supervision, balanced sampling, and ensemble knowledge distillation. It introduces a three-step pipeline (self-supervised/ImageNet pretraining, AudioSet Weak finetuning, and AudioSet Strong knowledge distillation) and demonstrates substantial PSDS1 gains across five transformer architectures, culminating in a strong ensemble around 47.1 PSDS1 on AudioSet Strong. Transferability is evaluated on downstream tasks (DESED, DC16-T2, MAESTRO) under frozen and finetuned regimes, with the best in-domain performance achieved when the AudioSet ontology aligns well with the tasks. Public checkpoints are released to accelerate research in temporally precise audio understanding and related applications such as audio captioning, grounding, and retrieval.

Abstract

We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.

Effective Pre-Training of Audio Transformers for Sound Event Detection

TL;DR

The paper tackles frame-level sound event detection by enhancing pre-training of audio transformers with AudioSet Strong supervision, balanced sampling, and ensemble knowledge distillation. It introduces a three-step pipeline (self-supervised/ImageNet pretraining, AudioSet Weak finetuning, and AudioSet Strong knowledge distillation) and demonstrates substantial PSDS1 gains across five transformer architectures, culminating in a strong ensemble around 47.1 PSDS1 on AudioSet Strong. Transferability is evaluated on downstream tasks (DESED, DC16-T2, MAESTRO) under frozen and finetuned regimes, with the best in-domain performance achieved when the AudioSet ontology aligns well with the tasks. Public checkpoints are released to accelerate research in temporally precise audio understanding and related applications such as audio captioning, grounding, and retrieval.

Abstract

We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.
Paper Structure (18 sections, 3 equations, 1 figure, 2 tables)

This paper contains 18 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Full pre-training pipeline from scratch. Blue blocks stand for supervised training on the specified datasets and orange for self-supervised learning.