Table of Contents
Fetching ...

AudioScene: Integrating Object-Event Audio into 3D Scenes

Shuaihang Yuan, Congcong Wen, Muhammad Shafique, Anthony Tzes, Yi Fang

TL;DR

This work tackles the paucity of spatially grounded audio datasets by introducing Audio-ScanNet and Audio-RoboTHOR, which fuse audio clips with 3D scenes. A novel data-creation pipeline combines GPT-4–generated object–event mappings with manual verification and cross-dataset alignment to produce richly annotated, spatial audio data. The authors define two benchmarks—audio-based 3D visual grounding and audio-based robotic navigation—and demonstrate a strong performance boost when integrating audio encoders and LLM-based event linking, revealing both the potential and the current limitations of audiocentric methods in realistic environments. The datasets and methods offer practical value for multimodal learning in AR, smart homes, and robotics, while guiding future improvements in spatial audio understanding.

Abstract

The rapid advances in audio analysis underscore its vast potential for humancomputer interaction, environmental monitoring, and public safety; yet, existing audioonly datasets often lack spatial context. To address this gap, we present two novel audiospatial scene datasets, AudioScanNet and AudioRoboTHOR, designed to explore audioconditioned tasks within 3D environments. By integrating audio clips with spatially aligned 3D scenes, our datasets enable research on how audio signals interact with spatial context. To associate audio events with corresponding spatial information, we leverage the common sense reasoning ability of large language models and supplement them with rigorous human verification, This approach offers greater scalability compared to purely manual annotation while maintaining high standards of accuracy, completeness, and diversity, quantified through inter annotator agreement and performance on two benchmark tasks audio based 3D visual grounding and audio based robotic zeroshot navigation. The results highlight the limitations of current audiocentric methods and underscore the practical challenges and significance of our datasets in advancing audio guided spatial learning.

AudioScene: Integrating Object-Event Audio into 3D Scenes

TL;DR

This work tackles the paucity of spatially grounded audio datasets by introducing Audio-ScanNet and Audio-RoboTHOR, which fuse audio clips with 3D scenes. A novel data-creation pipeline combines GPT-4–generated object–event mappings with manual verification and cross-dataset alignment to produce richly annotated, spatial audio data. The authors define two benchmarks—audio-based 3D visual grounding and audio-based robotic navigation—and demonstrate a strong performance boost when integrating audio encoders and LLM-based event linking, revealing both the potential and the current limitations of audiocentric methods in realistic environments. The datasets and methods offer practical value for multimodal learning in AR, smart homes, and robotics, while guiding future improvements in spatial audio understanding.

Abstract

The rapid advances in audio analysis underscore its vast potential for humancomputer interaction, environmental monitoring, and public safety; yet, existing audioonly datasets often lack spatial context. To address this gap, we present two novel audiospatial scene datasets, AudioScanNet and AudioRoboTHOR, designed to explore audioconditioned tasks within 3D environments. By integrating audio clips with spatially aligned 3D scenes, our datasets enable research on how audio signals interact with spatial context. To associate audio events with corresponding spatial information, we leverage the common sense reasoning ability of large language models and supplement them with rigorous human verification, This approach offers greater scalability compared to purely manual annotation while maintaining high standards of accuracy, completeness, and diversity, quantified through inter annotator agreement and performance on two benchmark tasks audio based 3D visual grounding and audio based robotic zeroshot navigation. The results highlight the limitations of current audiocentric methods and underscore the practical challenges and significance of our datasets in advancing audio guided spatial learning.

Paper Structure

This paper contains 27 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of the process of Audio-Scene dataset creation. It involves selecting categories from a scene dataset to identify common objects, followed by a GPT-4 facilitated event generation phase producing corresponding audio events. Concurrently, audio events datasets are normalized. These events are then mapped to scene objects using GPT-4, forming object-event-audio pairs. The pairs are backmapped into the original 3D scene, finalizing the dataset.
  • Figure 2: Distribution of audio clips across different events in our dataset.
  • Figure 3: Data distribution of audio clips by object categories.
  • Figure 4: Visualization of Audio-based 3D Object Grounding Methods on the ScanNet Dataset. The figure illustrates the localization results for different audio inputs: 'snoring,' 'wind,' and 'toilet flush,' using various methods against the ground truth.
  • Figure 5: Robot Navigation Path Triggered by Audio Cue. The right side shows trajectory in response to the sound of a plant, and the left side displays robot's view of the target object upon arrival.