Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection

Jinbo Hu; Yin Cao; Ming Wu; Qiuqiang Kong; Feiran Yang; Mark D. Plumbley; Jun Yang

Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection

Jinbo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

TL;DR

The paper addresses the challenge of robust sound event localization and detection (SELD) across diverse acoustic environments by combining environment-independent pretraining, model-agnostic meta-learning (MAML), and a novel selective memory mechanism guided by environment representations. The proposed approach, environment-adaptive Meta-SELD, enables fast adaptation to unseen environments with limited data, while selectively attenuating irrelevant initial parameters to mitigate cross-environment conflicts. Through extensive experiments on STARSS23 and synthetic scenes, the method demonstrates improved localization performance and reveals meaningful environment representations that cluster by room and acoustic properties. The work advances practical SELD deployment by reducing adaptation time and data requirements, with broad implications for acoustic scene analysis in variable environments.

Abstract

Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, obtaining annotated samples for spatial sound events is notably costly. Deploying a SELD system in a new environment requires extensive time for re-training and fine-tuning. To overcome these challenges, we propose environment-adaptive Meta-SELD, designed for efficient adaptation to new environments using minimal data. Our method specifically utilizes computationally synthesized spatial data and employs Model-Agnostic Meta-Learning (MAML) on a pre-trained, environment-independent model. The method then utilizes fast adaptation to unseen real-world environments using limited samples from the respective environments. Inspired by the Learning-to-Forget approach, we introduce the concept of selective memory as a strategy for resolving conflicts across environments. This approach involves selectively memorizing target-environment-relevant information and adapting to the new environments through the selective attenuation of model parameters. In addition, we introduce environment representations to characterize different acoustic settings, enhancing the adaptability of our attenuation approach to various environments. We evaluate our proposed method on the development set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset and computationally synthesized scenes. Experimental results demonstrate the superior performance of the proposed method compared to conventional supervised learning methods, particularly in localization.

Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection

TL;DR

Abstract

Paper Structure (33 sections, 12 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 12 equations, 11 figures, 7 tables, 1 algorithm.

Introduction
Learning-based SELD methods
Environment shifts and conflicts
Data acquisition
Meta learning
Our contributions
Fast adaptation to the environment
Pre-trained environment-independent models
Data synthesis
Network architecture
Meta-SELD
Meta-SELD++
Environment-Adaptive Meta-SELD
Selective memory
Environment representations
...and 18 more sections

Figures (11)

Figure 1: Room-wise metric scores of our previous system hu22dw submitted to Task 3 of the DCASE 2022 Challenge on the STARSS22 validation set. The description of each metric is expounded in Section \ref{['sec: metric']}. The system obtained the second rank in the team ranking.
Figure 2: A diagram of the meta-training procedure for our proposed environment-adaptive Meta-SELD. For simplicity, we only consider one gradient update for the inner loop of the training procedure. $\mathcal{N}$ indicates the number of tasks in the meta-training set. $f_\Theta$, $g_\Omega$, and $h_\Phi$ represent the backbone, environment extractor, and attenuation network, respectively. $\Theta^l$, where $l=1\dots p$, denotes the $l$-th layer of the total $p$-layer backbone $f_\Theta$.
Figure 3: An illustration of Meta-SELD++ with and without selective memory. Selective memory is proposed to tackle environment conflicts. It adds an additional step for attenuation of initial parameters before fast adaptation to the task $i$. The selective memory method provides a better solution (closer to the optimal solution).
Figure 4: The network architecture of the SELD network with the sub-network of environment representation extraction. The environment representations are extracted from output feature maps of each layer of the backbone.
Figure 5: The data division for the meta-training and the meta-test sets according to recording environments.
...and 6 more figures

Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection

TL;DR

Abstract

Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (11)