Table of Contents
Fetching ...

Unsupervised Acoustic Scene Mapping Based on Acoustic Features and Dimensionality Reduction

Idan Cohen, Ofir Lindenbaum, Sharon Gannot

TL;DR

An unsupervised data-driven approach that exploits the natural structure of the data and adapts the recently proposed local conformal autoencoder (LOCA) – an offline deep learning scheme for extracting standardized data coordinates from measurements.

Abstract

Classical methods for acoustic scene mapping require the estimation of time difference of arrival (TDOA) between microphones. Unfortunately, TDOA estimation is very sensitive to reverberation and additive noise. We introduce an unsupervised data-driven approach that exploits the natural structure of the data. Our method builds upon local conformal autoencoders (LOCA) - an offline deep learning scheme for learning standardized data coordinates from measurements. Our experimental setup includes a microphone array that measures the transmitted sound source at multiple locations across the acoustic enclosure. We demonstrate that LOCA learns a representation that is isometric to the spatial locations of the microphones. The performance of our method is evaluated using a series of realistic simulations and compared with other dimensionality-reduction schemes. We further assess the influence of reverberation on the results of LOCA and show that it demonstrates considerable robustness.

Unsupervised Acoustic Scene Mapping Based on Acoustic Features and Dimensionality Reduction

TL;DR

An unsupervised data-driven approach that exploits the natural structure of the data and adapts the recently proposed local conformal autoencoder (LOCA) – an offline deep learning scheme for extracting standardized data coordinates from measurements.

Abstract

Classical methods for acoustic scene mapping require the estimation of time difference of arrival (TDOA) between microphones. Unfortunately, TDOA estimation is very sensitive to reverberation and additive noise. We introduce an unsupervised data-driven approach that exploits the natural structure of the data. Our method builds upon local conformal autoencoders (LOCA) - an offline deep learning scheme for learning standardized data coordinates from measurements. Our experimental setup includes a microphone array that measures the transmitted sound source at multiple locations across the acoustic enclosure. We demonstrate that LOCA learns a representation that is isometric to the spatial locations of the microphones. The performance of our method is evaluated using a series of realistic simulations and compared with other dimensionality-reduction schemes. We further assess the influence of reverberation on the results of LOCA and show that it demonstrates considerable robustness.
Paper Structure (7 sections, 5 equations, 4 figures, 1 table)

This paper contains 7 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An illustration of LOCA adapted to acoustic scene mapping. The observation space $(\mathcal{Y})$ is assumed to model a nonlinear deformation of the inaccessible manifold $(\mathcal{X})$. We attempt to invert the unknown measurement function, utilizing the bursts sampling strategy. LOCA consists of an encoder ($E$) parameterized by $\rho$ and a decoder ($D$) parameterized by $\gamma$. The autoencoder receives a set of points and corresponding neighborhoods; each neighborhood is depicted as a dark oval point cloud (at the top of the figure), which we implement using a microphone array. At the bottom, we zoom in onto a single anchor point $y_i$ (green) and its corresponding neighborhood $Y_i$ (bounded by a blue ellipsoid). The encoder attempts to whiten each neighborhood in the embedding space, while the decoder aims at reconstructing the input. In practice, a dedicated microphone array constellation will be used to extract bursts of adjacent acoustic samples of RTF (see Fig. \ref{['fig:setting']}).
  • Figure 2: Room sampling strategy visualization. Blue circles denote the location of the sound sources. The sampling grid along which the device travels is shown in grey. We zoom in to show the schematic configuration of the burst microphone array: seven cross-like components consisting of vertical and horizontal microphone pairs.
  • Figure 3: 2-D geometric reconstruction achieved by the embedding of our framework (LOCA). The coloring indicates the correlation with the true vertical coordinate of the original scene.
  • Figure 4: A visualization of the extrapolation capabilities of LOCA and A-DM for $\textrm{RT}_{60}=160$ ms. The samples from the extrapolated region are shown in red. The MAE values of the extrapolated region are $16.1$ cm for LOCA and $67.4$ cm for A-DM.