AudioScene: Integrating Object-Event Audio into 3D Scenes
Shuaihang Yuan, Congcong Wen, Muhammad Shafique, Anthony Tzes, Yi Fang
TL;DR
This work tackles the paucity of spatially grounded audio datasets by introducing Audio-ScanNet and Audio-RoboTHOR, which fuse audio clips with 3D scenes. A novel data-creation pipeline combines GPT-4–generated object–event mappings with manual verification and cross-dataset alignment to produce richly annotated, spatial audio data. The authors define two benchmarks—audio-based 3D visual grounding and audio-based robotic navigation—and demonstrate a strong performance boost when integrating audio encoders and LLM-based event linking, revealing both the potential and the current limitations of audiocentric methods in realistic environments. The datasets and methods offer practical value for multimodal learning in AR, smart homes, and robotics, while guiding future improvements in spatial audio understanding.
Abstract
The rapid advances in audio analysis underscore its vast potential for humancomputer interaction, environmental monitoring, and public safety; yet, existing audioonly datasets often lack spatial context. To address this gap, we present two novel audiospatial scene datasets, AudioScanNet and AudioRoboTHOR, designed to explore audioconditioned tasks within 3D environments. By integrating audio clips with spatially aligned 3D scenes, our datasets enable research on how audio signals interact with spatial context. To associate audio events with corresponding spatial information, we leverage the common sense reasoning ability of large language models and supplement them with rigorous human verification, This approach offers greater scalability compared to purely manual annotation while maintaining high standards of accuracy, completeness, and diversity, quantified through inter annotator agreement and performance on two benchmark tasks audio based 3D visual grounding and audio based robotic zeroshot navigation. The results highlight the limitations of current audiocentric methods and underscore the practical challenges and significance of our datasets in advancing audio guided spatial learning.
