Table of Contents
Fetching ...

Fine-tune the pretrained ATST model for sound event detection

Nian Shao, Xian Li, Xiaofei Li

TL;DR

The paper tackles data scarcity in sound event detection by enabling fine-tuning of a large pretrained SelfSL model, ATST-Frame, within a CRNN-based SED system. It introduces ATST-SED, replacing BEATs with frame-level ATST-Frame, and presents a two-stage fine-tuning workflow that first freezes ATST-Frame and then leverages heavily weighted unsupervised losses (MT and ICT) on in-domain unlabeled data. Empirical results on DESED and DCASE Task 4 show state-of-the-art PSDS1/PSDS2 scores (0.587/0.812) and clear gains over frozen baselines, with ablations confirming the importance of first-stage initialization and each component (frequency warping, mixup, ICT, MT). The method achieves notable improvements over existing SOTA systems and demonstrates the practical impact of adapting large SelfSL models to SED, suggesting broader applicability to other downstream tasks.

Abstract

Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a fine-tuning method for ATST-Frame using both (in-domain) unlabelled and labelled SED data. Our experiments show that, the proposed method overcomes the overfitting problem when fine-tuning the large pretrained network, and our SED system obtains new SOTA results of 0.587/0.812 PSDS1/PSDS2 scores on the DCASE challenge task 4 dataset.

Fine-tune the pretrained ATST model for sound event detection

TL;DR

The paper tackles data scarcity in sound event detection by enabling fine-tuning of a large pretrained SelfSL model, ATST-Frame, within a CRNN-based SED system. It introduces ATST-SED, replacing BEATs with frame-level ATST-Frame, and presents a two-stage fine-tuning workflow that first freezes ATST-Frame and then leverages heavily weighted unsupervised losses (MT and ICT) on in-domain unlabeled data. Empirical results on DESED and DCASE Task 4 show state-of-the-art PSDS1/PSDS2 scores (0.587/0.812) and clear gains over frozen baselines, with ablations confirming the importance of first-stage initialization and each component (frequency warping, mixup, ICT, MT). The method achieves notable improvements over existing SOTA systems and demonstrates the practical impact of adapting large SelfSL models to SED, suggesting broader applicability to other downstream tasks.

Abstract

Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a fine-tuning method for ATST-Frame using both (in-domain) unlabelled and labelled SED data. Our experiments show that, the proposed method overcomes the overfitting problem when fine-tuning the large pretrained network, and our SED system obtains new SOTA results of 0.587/0.812 PSDS1/PSDS2 scores on the DCASE challenge task 4 dataset.
Paper Structure (12 sections, 2 equations, 3 figures, 4 tables)

This paper contains 12 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The architecture of the baseline and proposed SED system. The dashed blocks stand for the non-parametric modules and the solid blocks stand for the parametric modules. The feature dimensions before the merge layer are annotated, where $\text{T}_{\text{CNN}}$ and $\text{T}_{\text{SelfSL}}$ denotes the sequence length of the two modules.
  • Figure 2: TSNE Hinton2008tsne visualization on the frame-level features generated by the BEATs Chen2023BEATs and ATST-Frame li2023self. For both models, we randomly sample 1 frame-level representation from all the real audio clips in the DESED dataset.
  • Figure 3: The flowchart of the proposed fine-tuning strategy. The gray block of ATST-Frame means it is frozen.