Disentangled Acoustic Fields For Multimodal Physical Scene Understanding

Jie Yin; Andrew Luo; Yilun Du; Anoop Cherian; Tim K. Marks; Jonathan Le Roux; Chuang Gan

Disentangled Acoustic Fields For Multimodal Physical Scene Understanding

Jie Yin, Andrew Luo, Yilun Du, Anoop Cherian, Tim K. Marks, Jonathan Le Roux, Chuang Gan

TL;DR

This work introduces Disentangled Acoustic Fields (DAFs) to model cross-scene sound formation for multimodal physical scene understanding. By packaging sound generation factors into an explicit, disentangled latent space and employing an analysis-by-synthesis loop with a PSD-focused output, the approach enables robust, cross-environment inference of object location, type, and material, while producing spatial uncertainty maps to guide exploration. The encoder–generator framework, with KL regularization and PSD losses, supports amortized inference and uncertainty-aware planning that integrates audio with RGB-D visual cues. Experiments on TDW-based Find Fallen Object and related datasets show significant gains over baselines in object-property inference and navigation efficiency, including cross-scene generalization. Overall, the method advances sound-based localization in cluttered, multi-environment settings by combining physical sound formation modeling with multimodal planning.

Abstract

We study the problem of multimodal physical scene understanding, where an embodied agent needs to find fallen objects by inferring object properties, direction, and distance of an impact sound source. Previous works adopt feed-forward neural networks to directly regress the variables from sound, leading to poor generalization and domain adaptation issues. In this paper, we illustrate that learning a disentangled model of acoustic formation, referred to as disentangled acoustic field (DAF), to capture the sound generation and propagation process, enables the embodied agent to construct a spatial uncertainty map over where the objects may have fallen. We demonstrate that our analysis-by-synthesis framework can jointly infer sound properties by explicitly decomposing and factorizing the latent space of the disentangled model. We further show that the spatial uncertainty map can significantly improve the success rate for the localization of fallen objects by proposing multiple plausible exploration locations.

Disentangled Acoustic Fields For Multimodal Physical Scene Understanding

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 16 sections, 9 equations, 8 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Neural Implicit Representations
Multimodal Scene Understanding
Audio-Visual Navigation
Proposed Method
Physics of Sound
Disentangled Acoustic Fields (DAFs)
Inference of Sound Parameters
Experiment
Inference of Object Properties
Navigation and Planning
Conclusion
Appendix
Navigation Failure Case Analysis
...and 1 more sections

Figures (8)

Figure 1: Illustration of DAFs. The encoder maps the binaural short-time Fourier transform (STFT) of the audio input into a new space containing physical audio information such as object position, material, type, and a continuous latent. The decoder utilizes these parameters to reconstruct the power spectral density (PSD) of the audio. The two components form an analysis-by-synthesis loop capable of inferring object properties, and are jointly learned during training.
Figure 2: Planning with DAFs. The agent jointly uses auditory and visual information as part of the planning process. The auditory branch takes as input the sound $S$ represented as STFT. Using the DAF, we infer the factors responsible for the sound production including possible object types and a reconstruction loss map for each potential object location. The visual branch takes as input RGB-D images and provides a semantic map and occupancy map to the planner. The planner combines the information and uses the loss map to produce a priority list of locations. Path planning is completed using the $A^*$ algorithm.
Figure 3: Visualization of visual input and the sound-derived loss map in four scenes.Top: RGB images of the agent's view with the target object in a red bounding box. Middle: Semantic map produced from the RGB images. Bottom: The red line indicates the path the agent takes, with the end point shown as a circular dot. The ground-truth object location is shown as a gold star.
Figure 4: Comparison of agent trajectories. We compare the agent trajectories using our method (Red) against the trajectories produced by the modular planning baseline (Green). The loss map uses dark blue to indicate regions of low error, while yellow is used to indicate regions of high error. This figure compares the uncertainty maps of various cases. Darker colors indicate lower values of position loss. The star (Gold) symbolizes the ground truth position of the fallen object. The end of each trajectory is circled in white for clarity. In (a)$-$(f), the baseline method fails to find the target, while our method succeeds. In (g)$-$(h), both methods find the target, but our method takes a shorter path.
Figure 5: Failure from visual branch. The visual branch is learned independently of the auditory branch. Semantic segmentation errors can occur when objects are visually small or of low contrast. Future work can explore the contrastive learning of joint audio-visual representations.
...and 3 more figures

Disentangled Acoustic Fields For Multimodal Physical Scene Understanding

TL;DR

Abstract

Disentangled Acoustic Fields For Multimodal Physical Scene Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)