Table of Contents
Fetching ...

wav2pos: Sound Source Localization using Masked Autoencoders

Axel Berg, Jens Gulin, Mark O'Connor, Chuteng Zhou, Karl Åström, Magnus Oskarsson

TL;DR

wav2pos addresses 3D sound source localization with ad-hoc microphone arrays by formulating SSL as a multimodal set-to-set regression problem. It uses a masked autoencoder built on a transformer architecture to jointly process audio signals and microphone coordinates, allowing predictions for the sound source and missing microphones while handling variable array configurations. Key contributions include a pairwise positional encoding scheme, a time-delay feature module with NGCC-PHAT TDOA inputs, and a masking strategy that enables robust performance under missing data; the method achieves competitive or superior accuracy on real LuViRA data and in simulated environments, with notable de-noising benefits. The approach offers a flexible, scalable framework for indoor SSL that can be extended to multiple sources and self-calibration scenarios, potentially broadening practical deployment in dynamic, real-world settings.

Abstract

We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods.

wav2pos: Sound Source Localization using Masked Autoencoders

TL;DR

wav2pos addresses 3D sound source localization with ad-hoc microphone arrays by formulating SSL as a multimodal set-to-set regression problem. It uses a masked autoencoder built on a transformer architecture to jointly process audio signals and microphone coordinates, allowing predictions for the sound source and missing microphones while handling variable array configurations. Key contributions include a pairwise positional encoding scheme, a time-delay feature module with NGCC-PHAT TDOA inputs, and a masking strategy that enables robust performance under missing data; the method achieves competitive or superior accuracy on real LuViRA data and in simulated environments, with notable de-noising benefits. The approach offers a flexible, scalable framework for indoor SSL that can be extended to multiple sources and self-calibration scenarios, potentially broadening practical deployment in dynamic, real-world settings.

Abstract

We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods.
Paper Structure (20 sections, 7 equations, 6 figures, 6 tables)

This paper contains 20 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Method overview: wav2pos can simultaneously localize a moving sound source and several microphones given audio recordings and microphone coordinates on a frame-by-frame basis. Here, predictions on the music3 recording from the LuViRa dataset yaman2023luvira are shown (viewed from above), where a moving median filter has been applied to predictions for better visualization.
  • Figure 2: High-level illustration of the proposed wav2pos method. Modality embedding (added before encoder and decoder) and pairwise positional encoding (added before decoder) are omitted for brevity. The mask tokens are not generated by the encoder, but appended as learnable tokens in the input sequence to the decoder.
  • Figure 3: Results on the simulated dataset under varying noise and reverberation conditions.
  • Figure 4: Microphone locations in the LuViRA dataset.
  • Figure 6: Cumulative error distributions on the LuViRA dataset using setup $1_a$ and different test splits.
  • ...and 1 more figures