MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Pengfei Cai; Yan Song; Kang Li; Haoyu Song; Ian McLoughlin

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin

TL;DR

This paper tackles sound event detection (SED) under limited labeled data by proposing MAT-SED, a pure Transformer-based architecture that uses masked-reconstruction pre-training for its context network and a global-local feature fusion to improve localization. The encoder (PaSST) provides strong latent representations, while the context network leverages Relative Positional Encoding to model temporal dependencies; pre-training on unlabeled data followed by mean-teacher semi-supervised fine-tuning enhances robustness and reduces overfitting. Key contributions include the first fully Transformer-based SED with self-supervised pre-training, an effective masked-reconstruction objective for temporal modeling, and a fusion strategy that integrates global and local cues for precise localization. Empirically, MAT-SED achieves PSDS1 = 0.587 and PSDS2 = 0.896 on DCASE2023 Task 4, surpassing prior methods and underscoring the potential of self-supervised pre-training for audio Transformers.

Abstract

Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 4 figures, 3 tables)

This paper contains 18 sections, 2 equations, 4 figures, 3 tables.

Introduction
Methodology
Model
Encoder network
Context network
Masked-reconstruction based pre-training
Fine-tuning
Experimental Setup
Dataset
Feature extraction and evaluation setting
Model and training setting
Results
Performance of the proposed methods
Ablation studies
Ablations of the context network
...and 3 more sections

Figures (4)

Figure 1: The architecture of MAT-SED, comprising two main components: the encoder network (green) and the context network (yellow), both of which are based on Transformer structures. "RPE" in the context network indicates the relative positional encoding.
Figure 2: The global-local feature fusion strategy in the fine-tuning stage.
Figure 3: Convergence curves of training MAT-SED from scratch and end-to-end fine-tuning after masked-reconstruction pre-training.
Figure 4: Impact of different masking ratio inimage the masked-reconstruction pre-training stage.

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

TL;DR

Abstract

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (4)