Table of Contents
Fetching ...

Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation

Davide Berghi, Philip J. B. Jackson

TL;DR

We address distance estimation in 3D SELD by introducing reverberation-based input features that capture distance cues from early reflections. Two feature families are proposed: DRR-based representations from direct and reverberant energy and an autocorrelation-based measure (stpACC) of early-floor reflections; both are designed to concatenate with standard SELD inputs. Pre-training on synthetic data and applying data augmentation improve distance accuracy and overall SELD performance, with autocorrelation-based features yielding the largest gains on STARSS23, reducing $RDE$ and improving $SELD$ across the dataset. The approach demonstrates that incorporating reverberation cues into input features can enhance 3D SELD when distance labels are available, offering a path toward more accurate spatial audio understanding in real rooms.

Abstract

Sound event localization and detection (SELD) involves predicting active sound event classes over time while estimating their positions. The localization subtask in SELD is usually treated as a direction of arrival estimation problem, ignoring source distance. Only recently, SELD was extended to 3D by incorporating distance estimation, enabling the prediction of sound event positions in 3D space (3D SELD). However, existing methods lack input features designed for distance estimation. We argue that reverberation encodes valuable information for this task. This paper introduces two novel feature formats for 3D SELD based on reverberation: one using direct-to-reverberant ratio (DRR) and another leveraging signal autocorrelation to provide the model with insights into early reflections. Pre-training on synthetic data improves relative distance error (RDE) and overall SELD score, with autocorrelation-based features reducing RDE by over 3 percentage points on the STARSS23 dataset. The code to extract the features is available at github.com/dberghi/SELD-distance-features.

Reverberation-based Features for Sound Event Localization and Detection with Distance Estimation

TL;DR

We address distance estimation in 3D SELD by introducing reverberation-based input features that capture distance cues from early reflections. Two feature families are proposed: DRR-based representations from direct and reverberant energy and an autocorrelation-based measure (stpACC) of early-floor reflections; both are designed to concatenate with standard SELD inputs. Pre-training on synthetic data and applying data augmentation improve distance accuracy and overall SELD performance, with autocorrelation-based features yielding the largest gains on STARSS23, reducing and improving across the dataset. The approach demonstrates that incorporating reverberation cues into input features can enhance 3D SELD when distance labels are available, offering a path toward more accurate spatial audio understanding in real rooms.

Abstract

Sound event localization and detection (SELD) involves predicting active sound event classes over time while estimating their positions. The localization subtask in SELD is usually treated as a direction of arrival estimation problem, ignoring source distance. Only recently, SELD was extended to 3D by incorporating distance estimation, enabling the prediction of sound event positions in 3D space (3D SELD). However, existing methods lack input features designed for distance estimation. We argue that reverberation encodes valuable information for this task. This paper introduces two novel feature formats for 3D SELD based on reverberation: one using direct-to-reverberant ratio (DRR) and another leveraging signal autocorrelation to provide the model with insights into early reflections. Pre-training on synthetic data improves relative distance error (RDE) and overall SELD score, with autocorrelation-based features reducing RDE by over 3 percentage points on the STARSS23 dataset. The code to extract the features is available at github.com/dberghi/SELD-distance-features.

Paper Structure

This paper contains 11 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Floor reflection path when source and receiver are at the same height ($h_s{=}h_m$) and separated by distance $d$.
  • Figure 2: RIRs from the omnidirectional FOA channel of the SurrRoom 1.0 dataset cieciura:2023:SurrRoom ("Pop_Recording_Studio" room) used to spatialize speech at different distances. Direct sound peaks are temporally aligned for comparison.
  • Figure 3: Autocorrelation coefficient at varying distances (top). Short-term power of the autocorrelation (bottom).
  • Figure 4: Distance features with respective log mel spectrogram extracted from a sequence of STARSS23.