Disentangled Acoustic Fields For Multimodal Physical Scene Understanding
Jie Yin, Andrew Luo, Yilun Du, Anoop Cherian, Tim K. Marks, Jonathan Le Roux, Chuang Gan
TL;DR
This work introduces Disentangled Acoustic Fields (DAFs) to model cross-scene sound formation for multimodal physical scene understanding. By packaging sound generation factors into an explicit, disentangled latent space and employing an analysis-by-synthesis loop with a PSD-focused output, the approach enables robust, cross-environment inference of object location, type, and material, while producing spatial uncertainty maps to guide exploration. The encoder–generator framework, with KL regularization and PSD losses, supports amortized inference and uncertainty-aware planning that integrates audio with RGB-D visual cues. Experiments on TDW-based Find Fallen Object and related datasets show significant gains over baselines in object-property inference and navigation efficiency, including cross-scene generalization. Overall, the method advances sound-based localization in cluttered, multi-environment settings by combining physical sound formation modeling with multimodal planning.
Abstract
We study the problem of multimodal physical scene understanding, where an embodied agent needs to find fallen objects by inferring object properties, direction, and distance of an impact sound source. Previous works adopt feed-forward neural networks to directly regress the variables from sound, leading to poor generalization and domain adaptation issues. In this paper, we illustrate that learning a disentangled model of acoustic formation, referred to as disentangled acoustic field (DAF), to capture the sound generation and propagation process, enables the embodied agent to construct a spatial uncertainty map over where the objects may have fallen. We demonstrate that our analysis-by-synthesis framework can jointly infer sound properties by explicitly decomposing and factorizing the latent space of the disentangled model. We further show that the spatial uncertainty map can significantly improve the success rate for the localization of fallen objects by proposing multiple plausible exploration locations.
