PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping

Subash Khanal; Eric Xing; Srikumar Sastry; Aayush Dhakal; Zhexiao Xiong; Adeel Ahmad; Nathan Jacobs

PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping

Subash Khanal, Eric Xing, Srikumar Sastry, Aayush Dhakal, Zhexiao Xiong, Adeel Ahmad, Nathan Jacobs

TL;DR

The paper tackles global soundscape mapping by learning a probabilistic, multi-modal embedding across audio, multi-scale satellite imagery, and text, with metadata conditioning and uncertainty estimation. It introduces the GeoSound dataset (over $300{,}000$ geotagged audios) and a Probabilistic Soundscape Mapping (PSM) framework that employs a scale-aware satellite encoder with Ground-Sample Distance Positional Embedding (GSDPE) and a metadata fusion transformer, optimized via PCME++ with pseudo-positives and a Variational Information Bottleneck. PSM achieves state-of-the-art cross-modal retrieval on GeoSound and SoundingEarth, significantly benefiting from metadata and text conditioning to produce uncertainty-aware, temporally dynamic soundscape maps. This work enables global, scalable, and temporally controllable soundscape mapping with explicit uncertainty estimates, aiding applications in environmental monitoring, urban planning, and location-based awareness.

Abstract

A soundscape is defined by the acoustic environment a person perceives at a location. In this work, we propose a framework for mapping soundscapes across the Earth. Since soundscapes involve sound distributions that span varying spatial scales, we represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text. To capture the inherent uncertainty in the soundscape of a location, we design the representation space to be probabilistic. We also fuse ubiquitous metadata (including geolocation, time, and data source) to enable learning of spatially and temporally dynamic representations of soundscapes. We demonstrate the utility of our framework by creating large-scale soundscape maps integrating both audio and text with temporal control. To facilitate future research on this task, we also introduce a large-scale dataset, GeoSound, containing over $300k$ geotagged audio samples paired with both low- and high-resolution satellite imagery. We demonstrate that our method outperforms the existing state-of-the-art on both GeoSound and the existing SoundingEarth dataset. Our dataset and code is available at https://github.com/mvrl/PSM.

PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping

TL;DR

geotagged audios) and a Probabilistic Soundscape Mapping (PSM) framework that employs a scale-aware satellite encoder with Ground-Sample Distance Positional Embedding (GSDPE) and a metadata fusion transformer, optimized via PCME++ with pseudo-positives and a Variational Information Bottleneck. PSM achieves state-of-the-art cross-modal retrieval on GeoSound and SoundingEarth, significantly benefiting from metadata and text conditioning to produce uncertainty-aware, temporally dynamic soundscape maps. This work enables global, scalable, and temporally controllable soundscape mapping with explicit uncertainty estimates, aiding applications in environmental monitoring, urban planning, and location-based awareness.

Abstract

geotagged audio samples paired with both low- and high-resolution satellite imagery. We demonstrate that our method outperforms the existing state-of-the-art on both GeoSound and the existing SoundingEarth dataset. Our dataset and code is available at https://github.com/mvrl/PSM.

Paper Structure (23 sections, 10 equations, 9 figures, 6 tables)

This paper contains 23 sections, 10 equations, 9 figures, 6 tables.

Introduction
Related Works
Audio Visual Learning
Deterministic Contrastive Learning
Probabilistic Contrastive Learning
Methods
Dataset Creation
Approach
Experimental Details
Results
Cross-Modal Retrieval with Bing
Cross-Modal Retrieval with Sentinel-2
Cross-Modal Retrieval on SoundingEarth
Effect of Metadata
Generating Country-Level Soundscape Maps
...and 8 more sections

Figures (9)

Figure 1: Our proposed framework, Probabilistic Soundscape Mapping (PSM), combines image, audio, and text encoders to learn a probabilistic joint representation space. Metadata, including geolocation (l), month (m), hour (h), audio-source (a), and caption-source (t), is encoded separately and fused with image embeddings using a transformer-based metadata fusion module. For each encoder, $\mu$ and $\sigma$ heads yield probabilistic embeddings, which are used to compute probabilistic contrastive loss.
Figure 2: Soundscape Map of the USA for a textual query Sound of insects, compared with a reference map NIDRM indicating the risk of pest-related hazard.
Figure 3: Two soundscape maps of the continental United States, generated using different query types, with a land cover map landcover for reference.
Figure 4: Temporally dynamic soundscape maps created by querying our model for different geographic areas.
Figure 5: Distribution of samples in the GeoSound dataset.
...and 4 more figures

PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping

TL;DR

Abstract

PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping

Authors

TL;DR

Abstract

Table of Contents

Figures (9)