Table of Contents
Fetching ...

SONNET: Enhancing Time Delay Estimation by Leveraging Simulated Audio

Erik Tegler, Magnus Oskarsson, Kalle Åström

TL;DR

This paper demonstrates that learning based methods can, even based on synthetic data, significantly outperform GCC-PHAT on novel real world data and demonstrates greatly improved performance on the downstream task of self-calibration when using the model compared to classical methods.

Abstract

Time delay estimation or Time-Difference-Of-Arrival estimates is a critical component for multiple localization applications such as multilateration, direction of arrival, and self-calibration. The task is to estimate the time difference between a signal arriving at two different sensors. For the audio sensor modality, most current systems are based on classical methods such as the Generalized Cross-Correlation Phase Transform (GCC-PHAT) method. In this paper we demonstrate that learning based methods can, even based on synthetic data, significantly outperform GCC-PHAT on novel real world data. To overcome the lack of data with ground truth for the task, we train our model on a simulated dataset which is sufficiently large and varied, and that captures the relevant characteristics of the real world problem. We provide our trained model, SONNET (Simulation Optimized Neural Network Estimator of Timeshifts), which is runnable in real-time and works on novel data out of the box for many real data applications, i.e. without re-training. We further demonstrate greatly improved performance on the downstream task of self-calibration when using our model compared to classical methods.

SONNET: Enhancing Time Delay Estimation by Leveraging Simulated Audio

TL;DR

This paper demonstrates that learning based methods can, even based on synthetic data, significantly outperform GCC-PHAT on novel real world data and demonstrates greatly improved performance on the downstream task of self-calibration when using the model compared to classical methods.

Abstract

Time delay estimation or Time-Difference-Of-Arrival estimates is a critical component for multiple localization applications such as multilateration, direction of arrival, and self-calibration. The task is to estimate the time difference between a signal arriving at two different sensors. For the audio sensor modality, most current systems are based on classical methods such as the Generalized Cross-Correlation Phase Transform (GCC-PHAT) method. In this paper we demonstrate that learning based methods can, even based on synthetic data, significantly outperform GCC-PHAT on novel real world data. To overcome the lack of data with ground truth for the task, we train our model on a simulated dataset which is sufficiently large and varied, and that captures the relevant characteristics of the real world problem. We provide our trained model, SONNET (Simulation Optimized Neural Network Estimator of Timeshifts), which is runnable in real-time and works on novel data out of the box for many real data applications, i.e. without re-training. We further demonstrate greatly improved performance on the downstream task of self-calibration when using our model compared to classical methods.

Paper Structure

This paper contains 17 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Since the microphones are at different distances from the speaker, the signal arrives at different times for each of them. By estimating the timeshift in the signals (right figure), and combining it with the propagation speed of the signal, we get a measurement of distance difference (left figure).
  • Figure 2: System overview: Our model takes two audio recordings of length $d$ as input data. The data is first converted to the frequency domain, using the fast Fourier transform, and stored with real and imaginary components as different channels. It is then sent through a series of 1d convolutional layers. The features are then processed using $M$ stacked pairs of linear layers along with skip connections. Finally, the logits are acquired by adding a linear layer after the last residual block.
  • Figure 3: Results on the simulated data. (a) Noise sensitivity evaluated at $T_{60}$ = 0.2 s. Note that GCC-PHAT is very robust against white noise (b) Reverberation sensitivity evaluated at SNR = 10 dB
  • Figure 4: Quantitative results on the dataset tdoa_20201016 showing the probability of correct detection at different inlier thresholds. We have marked the 10 cm threshold which we use as our main evaluation metric.
  • Figure 5: Qualitative results of the estimated TDOA values on the dataset tdoa_20201016. (a) and (c) correspond to the recording music_0014 while (b) and (d) correspond to chirp_0001. The microphone paired used for all four plots are microphone 1 and microphone 6. SONNET significantly outperforms GCC-PHAT when music is played while also achieving a performance gain when chirp sounds are played.
  • ...and 2 more figures