Physics-Guided Variational Model for Unsupervised Sound Source Tracking

Luan Vinícius Fiorio; Ivana Nikoloska; Bruno Defraene; Alex Young; Johan David; Ronald M. Aarts

Physics-Guided Variational Model for Unsupervised Sound Source Tracking

Luan Vinícius Fiorio, Ivana Nikoloska, Bruno Defraene, Alex Young, Johan David, Ronald M. Aarts

TL;DR

A physics-guided variational model capable of fully unsupervised single-source sound source tracking is introduced, which combines a variational encoder with a physics-based decoder that injects geometric constraints into the latent space through analytically derived pairwise time-delay likelihoods.

Abstract

Sound source tracking is often performed using classical array-processing algorithms. Alternative methods, such as machine learning, rely on ground truth position labels, which are costly to obtain. We propose a variational model that can perform single-source unsupervised sound source tracking in latent space, aided by a physics-based decoder. Our experiments demonstrate that the proposed method surpasses traditional baselines and achieves performance and computational complexity comparable to state-of-the-art supervised models. We also show that the method presents substantial robustness to altered microphone array geometries and corrupted microphone position metadata. Finally, the method is extended to multi-source sound tracking and the basic theoretical changes are proposed.

Physics-Guided Variational Model for Unsupervised Sound Source Tracking

TL;DR

Abstract

Paper Structure (21 sections, 28 equations, 3 figures, 6 tables)

This paper contains 21 sections, 28 equations, 3 figures, 6 tables.

Introduction
Sound source tracking
Classical
Supervised learning
Unsupervised learning
Proposed model
Features
Variational autoencoder
Physics-based decoder
Variational posterior
Reparameterization
Loss function
Architecture
Experiments
Data
...and 6 more sections

Figures (3)

Figure 1: The proposed physics-guided variational model. Notice that $(\mathbf{v}_i,\mathbf{v}_j)$ is treated as metadata instead of a random variable.
Figure 2: Encoder architecture. A conv. block takes input $\mathbf{x}$ and metadata $\mathbf{v}$, and is composed of a 128-channel output 2D conv. layer, with kernel (3,3), unitary stride and padding. Its output is biased by a metadata projection through a linear layer. Single-group normalization is applied, which output passes through a PReLU and a 2D max pooling layer. The encoder is composed of three conv. blocks, with pooling sizes (2,2,2) in the lag axis and (5,1,1) in the time axis. They are followed by a 128-sized output unidirectional gate recurrent unit (GRU) with two layers, and a pairwise MLP of two layers, both activated by PReLU with output of size 128. The pairwise MLP outputs are combined by summation, which is then processed by a final MLP of two layers. The first layer has a PReLU activation and the second is purely linear, resulting in four outputs, of which three are the coordinates of $\boldsymbol{\mu}_\phi$, and the remaining represents $\kappa_\phi$.
Figure 3: DOA estimation example of the proposed variational model for real-world (LOCATA) data. The time-domain signal captured by one of the microphones is displayed in gray, while blue and orange curves represent, respectively, the azimuth and elevation targets, with their dashed counterparts showing estimated angles.

Physics-Guided Variational Model for Unsupervised Sound Source Tracking

TL;DR

Abstract

Physics-Guided Variational Model for Unsupervised Sound Source Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (3)