Table of Contents
Fetching ...

SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera

Yuhang He, Sangyun Shin, Anoop Cherian, Niki Trigoni, Andrew Markham

TL;DR

SoundLoc3D addresses the challenge of localizing and classifying visually invisible 3D sound sources that lie on object surfaces by leveraging a multimodal RGB-D acoustic-camera rig. It frames the task as set prediction with learnable queries that are initialized from single-view mic-array signals and progressively refined using multiview RGB-D cues, depth-based surface proximity, and cross-view consistency, all integrated through a Transformer-based query mixer. The method demonstrates clear performance gains over stronger baselines on a large-scale synthetic dataset, with depth information providing the most significant improvement and robustness to ambient noise and depth inaccuracies. This approach offers a scalable, efficient solution for reliable 3D sound source localization in real-world scenarios such as monitoring machinery or detecting gas leaks in cluttered environments.

Abstract

Accurately localizing 3D sound sources and estimating their semantic labels -- where the sources may not be visible, but are assumed to lie on the physical surface of objects in the scene -- have many real applications, including detecting gas leak and machinery malfunction. The audio-visual weak-correlation in such setting poses new challenges in deriving innovative methods to answer if or how we can use cross-modal information to solve the task. Towards this end, we propose to use an acoustic-camera rig consisting of a pinhole RGB-D camera and a coplanar four-channel microphone array~(Mic-Array). By using this rig to record audio-visual signals from multiviews, we can use the cross-modal cues to estimate the sound sources 3D locations. Specifically, our framework SoundLoc3D treats the task as a set prediction problem, each element in the set corresponds to a potential sound source. Given the audio-visual weak-correlation, the set representation is initially learned from a single view microphone array signal, and then refined by actively incorporating physical surface cues revealed from multiview RGB-D images. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset, and further show its robustness to RGB-D measurement inaccuracy and ambient noise interference.

SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera

TL;DR

SoundLoc3D addresses the challenge of localizing and classifying visually invisible 3D sound sources that lie on object surfaces by leveraging a multimodal RGB-D acoustic-camera rig. It frames the task as set prediction with learnable queries that are initialized from single-view mic-array signals and progressively refined using multiview RGB-D cues, depth-based surface proximity, and cross-view consistency, all integrated through a Transformer-based query mixer. The method demonstrates clear performance gains over stronger baselines on a large-scale synthetic dataset, with depth information providing the most significant improvement and robustness to ambient noise and depth inaccuracies. This approach offers a scalable, efficient solution for reliable 3D sound source localization in real-world scenarios such as monitoring machinery or detecting gas leaks in cluttered environments.

Abstract

Accurately localizing 3D sound sources and estimating their semantic labels -- where the sources may not be visible, but are assumed to lie on the physical surface of objects in the scene -- have many real applications, including detecting gas leak and machinery malfunction. The audio-visual weak-correlation in such setting poses new challenges in deriving innovative methods to answer if or how we can use cross-modal information to solve the task. Towards this end, we propose to use an acoustic-camera rig consisting of a pinhole RGB-D camera and a coplanar four-channel microphone array~(Mic-Array). By using this rig to record audio-visual signals from multiviews, we can use the cross-modal cues to estimate the sound sources 3D locations. Specifically, our framework SoundLoc3D treats the task as a set prediction problem, each element in the set corresponds to a potential sound source. Given the audio-visual weak-correlation, the set representation is initially learned from a single view microphone array signal, and then refined by actively incorporating physical surface cues revealed from multiview RGB-D images. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset, and further show its robustness to RGB-D measurement inaccuracy and ambient noise interference.

Paper Structure

This paper contains 25 sections, 14 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: SoundLoc3D problem setup: Visually invisible sound sources freely lie on physical object's surface and are emitting sound, A: We use an acoustic-camera to record Mic-Array signal and RGB-D images from multiview. SoundLoc3D incorporates multiview crossmodal RGB images, depth maps and Mic-Array signal to jointly localize source position $p$ and semantic label $c$.
  • Figure 2: SoundLoc3D Pipeline. The RGB image is first pre-processed by a feature matching aware pre-trained model to get an embedding (LoFTR), Mic-Array signal feature is extracted by stacking Log-Mel scale TF and GCC-Phat features. The query generator $\mathbfcal{G}$ is applied to get the initial queries, which are further fed to query decoder $\mathbfcal{D}$ to aggregate crossview RGB image informed sound source cues. The queries after aggregation is further optimized by Feature Mixer network $\mathbfcal{M}$. During training, these queries are matched with ground truth through bipartite matching and the loss considers the discrepancy between prediction and ground truth, depth map informed closeness, and multiview detection consistency. During inference, these optimized queries are simply decoded into sound sources.
  • Figure 3: Sound Source Cue from Multiview RGB-D images and Crossview Consistency: A. While only "on the surface" sound source's projections onto multiview RGB images are guaranteed to be visually similar, either above or below the surface sound sources are much less likely to be visually similar. B. The closer of predicted sound source to the object surface, the smaller of its distance to multiview depth maps informed source position (centroid). C. The same sound source predicted by each single view should be close enough across views.
  • Figure 4: Localization Result Visualization: We visualize the sound source localization result in the 3D visual space by different methods as well as its ground truth position. Zoom in for better visualization. We provide data and visualization code and in Supplementary material.
  • Figure 5: Ambient noise test: we add white Gaussian ambient noise measured by SNR in dB.
  • ...and 3 more figures