Table of Contents
Fetching ...

Learning Multi-Target TDOA Features for Sound Event Localization and Detection

Axel Berg, Johanna Engman, Jens Gulin, Karl Åström, Magnus Oskarsson

TL;DR

Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events and demonstrates improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.

Abstract

Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.

Learning Multi-Target TDOA Features for Sound Event Localization and Detection

TL;DR

Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events and demonstrates improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.

Abstract

Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.
Paper Structure (9 sections, 6 equations, 4 figures, 2 tables)

This paper contains 9 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of our pre-training strategy with $K=3$ tracks. Given a set of sound events, we train a neural GCC-PHAT to predict the TDOA of each event. When the number of sound events is less than $K$, auxiliary duplication of the labels is used. In this illustration, only two microphones are shown for brevity.
  • Figure 2: Illustration of how TDOA features are used together with log mel-spectrograms as input to the CST-Former network.
  • Figure 3: An example of the TDOA predictions $p_k(\tau | \mathbf{x}_i, \mathbf{x}_j )$ from the pre-trained NGCC-PHAT network using $K=3$ output tracks. Predictions are shown for all six microphone combinations $(i,j)$ at a single time frame with two events and ground truth TDOAs $\tau_{ij}^1$ and $\tau_{ij}^2$.
  • Figure 4: Micro-averaged F-score as a function of the angular threshold $T_{DO\space A}$ using different number of output tracks $K$ during TDOA pre-training. Evaluation was done using CST-Former Small.