Table of Contents
Fetching ...

Zero- and Few-shot Sound Event Localization and Detection

Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

TL;DR

The paper tackles zero- and few-shot SELD, where target classes can be specified by text or a few audio examples instead of a fixed training set. It introduces the embed-ACCDOA model, a two-branch network that outputs track-wise CLAP embeddings $\mathbf{E}$ and ACCDOA vectors $\mathbf{P}$ with $\mathbf{P}_{nt} = a_{nt}\mathbf{R}_{nt}$, $a_{nt}=\|\mathbf{P}_{nt}\|$, and $\mathbf{R}_{nt}=\mathbf{P}_{nt}/\|\mathbf{P}_{nt}\|$, trained via permutation-invariant training to align with oracle targets $\mathbf{E}^{*}$; at inference, support embeddings from zero-/few-shot targets are used to assign classes per track by cosine similarity, optionally enhanced by combining the CLAP audio encoder in single-source scenarios. Experimental results on STARSS23 and TNSSE21 show that the embed-ACCDOA approach achieves competitive location-dependent metrics and that the CLAP+Embed-ACCDOA combination delivers the best SELD error among zero-/few-shot methods, approaching official baselines trained on full data. This work enables flexible SELD deployment for user-defined class sets without retraining, with potential impact on surveillance, smart devices, and biodiversity monitoring.

Abstract

Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset.

Zero- and Few-shot Sound Event Localization and Detection

TL;DR

The paper tackles zero- and few-shot SELD, where target classes can be specified by text or a few audio examples instead of a fixed training set. It introduces the embed-ACCDOA model, a two-branch network that outputs track-wise CLAP embeddings and ACCDOA vectors with , , and , trained via permutation-invariant training to align with oracle targets ; at inference, support embeddings from zero-/few-shot targets are used to assign classes per track by cosine similarity, optionally enhanced by combining the CLAP audio encoder in single-source scenarios. Experimental results on STARSS23 and TNSSE21 show that the embed-ACCDOA approach achieves competitive location-dependent metrics and that the CLAP+Embed-ACCDOA combination delivers the best SELD error among zero-/few-shot methods, approaching official baselines trained on full data. This work enables flexible SELD deployment for user-defined class sets without retraining, with potential impact on surveillance, smart devices, and biodiversity monitoring.

Abstract

Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset.
Paper Structure (10 sections, 11 equations, 4 figures, 2 tables)

This paper contains 10 sections, 11 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of zero- and few-shot SELD system.
  • Figure 2: Zero- and few-shot sound classification and SELD tasks.
  • Figure 3: Overview of a 2-track embed-ACCDOA model.
  • Figure 4: SELD performance of the combination of the CLAP audio encoder and embed-ACCDOA model for STARSS23 with different numbers of shots.