w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

Orlem Lima dos Santos; Karen Rosero; Roberto de Alencar Lotufo

w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

Orlem Lima dos Santos, Karen Rosero, Roberto de Alencar Lotufo

TL;DR

This work addresses SELD with limited labeled spatial audio by adopting a self-supervised pre-training paradigm inspired by wav2vec 2.0, called w2v-SELD. It adapts a multichannel input, pre-trains on unlabeled spatial audio, and fine-tunes with two prediction schemes (frame-based and segment-based) for SED and DOA, using data augmentation to bolster generalization. The approach yields substantial gains over baselines and rivals state-of-the-art supervised methods, especially when pre-trained on unlabeled spatial data, and shows improvements even with relatively modest pre-training data sizes. These results demonstrate the practicality and impact of SSL for 3D audio tasks, reducing labeling requirements and enabling robust SELD in real-world settings.

Abstract

Sound Event Detection and Localization (SELD) constitutes a complex task that depends on extensive multichannel audio recordings with annotated sound events and their respective locations. In this paper, we introduce a self-supervised approach for SELD adapted from the pre-training methodology of wav2vec 2.0, which learns representations directly from raw audio data, eliminating the need for supervision. By applying this approach to SELD, we can leverage a substantial amount of unlabeled 3D audio data to learn robust representations of sound events and their locations. Our method comprises two primary stages: pre-training and fine-tuning. In the pre-training phase, unlabeled 3D audio datasets are utilized to train our w2v-SELD model, capturing intricate high-level features and contextual information inherent in audio signals. Subsequently, in the fine-tuning stage, a smaller dataset with labeled SELD data fine-tunes the pre-trained model. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed self-supervised approach for SELD. The model surpasses baseline systems provided with the datasets and achieves competitive performance comparable to state-of-the-art supervised methods. The code and pre-trained parameters of our w2v-SELD model are available in this repository.

w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

TL;DR

Abstract

Paper Structure (30 sections, 3 equations, 5 figures, 5 tables)

This paper contains 30 sections, 3 equations, 5 figures, 5 tables.

Introduction
Related Works
Theoretical Background
wav2vec 2.0 model
Pre-training Objective
Fine-tuning wav2vec 2.0
Methodology
Model adaptation
Pre-training
Fine-tuning
Segment-based Prediction (w2v-SELD-SegPred)
Frame-based Prediction (w2v-SELD-FramePred)
Data Augmentation for Spatial Audio
Traditional techniques
Time and frequency masking
...and 15 more sections

Figures (5)

Figure 1: Illustration of the SELDnet model. Adapted from adavanne2018sound.
Figure 2: Illustration of the wav2vec 2.0 model. Adapted from baevski2020wav2vec.
Figure 3: Illustration of the w2v-SELD-SegPred approach. $N$ denotes the number of SED classes, $T$ represents the number of time-steps, and $C$ denotes the embedding dimension of the w2v-SELD model.
Figure 4: Illustration of the w2v-SELD-FramePred. $N$ denotes the number of SED classes, $T$ represents the number of time-steps, and $C$ denotes the embedding dimension of the w2v-SELD model.
Figure 5: SpecAugment on w2v-SELD encoder output. (a) Raw audio signal. (b) Feature encoder output. (c) Feature encoder output with SpecAugment.

w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

TL;DR

Abstract

w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

Authors

TL;DR

Abstract

Table of Contents

Figures (5)