wav2pos: Sound Source Localization using Masked Autoencoders
Axel Berg, Jens Gulin, Mark O'Connor, Chuteng Zhou, Karl Åström, Magnus Oskarsson
TL;DR
wav2pos addresses 3D sound source localization with ad-hoc microphone arrays by formulating SSL as a multimodal set-to-set regression problem. It uses a masked autoencoder built on a transformer architecture to jointly process audio signals and microphone coordinates, allowing predictions for the sound source and missing microphones while handling variable array configurations. Key contributions include a pairwise positional encoding scheme, a time-delay feature module with NGCC-PHAT TDOA inputs, and a masking strategy that enables robust performance under missing data; the method achieves competitive or superior accuracy on real LuViRA data and in simulated environments, with notable de-noising benefits. The approach offers a flexible, scalable framework for indoor SSL that can be extended to multiple sources and self-calibration scenarios, potentially broadening practical deployment in dynamic, real-world settings.
Abstract
We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods.
