Multitask frame-level learning for few-shot sound event detection

Liang Zou; Genwei Yan; Ruoyu Wang; Jun Du; Meng Lei; Tian Gao; Xin Fang

Multitask frame-level learning for few-shot sound event detection

Liang Zou, Genwei Yan, Ruoyu Wang, Jun Du, Meng Lei, Tian Gao, Xin Fang

TL;DR

Few-shot SED struggles to detect short-duration events in noisy environments when relying on segment-level predictions. The authors propose a multitask frame-level learning framework that combines a Sound Foreground-Background Classification (SFBC) branch with a Transformer-based embedding, paired with TimeFilterAug to simulate noisy conditions. Through multitask pretraining and a fine-tuning procedure with support reconstruction, the approach yields robust frame-level predictions, achieving 63.8% F-score on the DCASE 2023 Task5 evaluation and demonstrating clear gains from the Transformer encoder and TimeFilterAug in ablations. The results indicate strong practical potential for efficient, robust frame-level SED in diverse acoustic settings.

Abstract

This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples. However, prevailing methods methods in few-shot SED predominantly rely on segment-level predictions, which often providing detailed, fine-grained predictions, particularly for events of brief duration. Although frame-level prediction strategies have been proposed to overcome these limitations, these strategies commonly face difficulties with prediction truncation caused by background noise. To alleviate this issue, we introduces an innovative multitask frame-level SED framework. In addition, we introduce TimeFilterAug, a linear timing mask for data augmentation, to increase the model's robustness and adaptability to diverse acoustic environments. The proposed method achieves a F-score of 63.8%, securing the 1st rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2023.

Multitask frame-level learning for few-shot sound event detection

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 3 figures, 3 tables)

This paper contains 12 sections, 6 equations, 3 figures, 3 tables.

Introduction
DATASET AND METHODOLOGY
Dataset
Multitask training framework
Multitask fine-tuning framework
TimeFilterAug
EXPERIMENT
Experimental setups
Experimental results
Modification on frame-level system
Ablation study
CONCLUSIONS

Figures (3)

Figure 1: Multitask frame-level embedding learning training framework. M is 20, N is 2. $C_n$ denotes the sequentially selected target class.
Figure 2: The feature interaction in the Transformer Encoder. The repetitive $POS$ embeddings and SED embeddings are in the dim1 and dim2 channel, respectively.
Figure 3: Multitask frame-level learning fine-tuning framework. The $POS$ Anchor in (b) denotes the frames belonging to the sound event. M is 20, N is 2.

Multitask frame-level learning for few-shot sound event detection

TL;DR

Abstract

Multitask frame-level learning for few-shot sound event detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)