PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping
Subash Khanal, Eric Xing, Srikumar Sastry, Aayush Dhakal, Zhexiao Xiong, Adeel Ahmad, Nathan Jacobs
TL;DR
The paper tackles global soundscape mapping by learning a probabilistic, multi-modal embedding across audio, multi-scale satellite imagery, and text, with metadata conditioning and uncertainty estimation. It introduces the GeoSound dataset (over $300{,}000$ geotagged audios) and a Probabilistic Soundscape Mapping (PSM) framework that employs a scale-aware satellite encoder with Ground-Sample Distance Positional Embedding (GSDPE) and a metadata fusion transformer, optimized via PCME++ with pseudo-positives and a Variational Information Bottleneck. PSM achieves state-of-the-art cross-modal retrieval on GeoSound and SoundingEarth, significantly benefiting from metadata and text conditioning to produce uncertainty-aware, temporally dynamic soundscape maps. This work enables global, scalable, and temporally controllable soundscape mapping with explicit uncertainty estimates, aiding applications in environmental monitoring, urban planning, and location-based awareness.
Abstract
A soundscape is defined by the acoustic environment a person perceives at a location. In this work, we propose a framework for mapping soundscapes across the Earth. Since soundscapes involve sound distributions that span varying spatial scales, we represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text. To capture the inherent uncertainty in the soundscape of a location, we design the representation space to be probabilistic. We also fuse ubiquitous metadata (including geolocation, time, and data source) to enable learning of spatially and temporally dynamic representations of soundscapes. We demonstrate the utility of our framework by creating large-scale soundscape maps integrating both audio and text with temporal control. To facilitate future research on this task, we also introduce a large-scale dataset, GeoSound, containing over $300k$ geotagged audio samples paired with both low- and high-resolution satellite imagery. We demonstrate that our method outperforms the existing state-of-the-art on both GeoSound and the existing SoundingEarth dataset. Our dataset and code is available at https://github.com/mvrl/PSM.
