Table of Contents
Fetching ...

SonifyAR: Context-Aware Sound Generation in Augmented Reality

Xia Su, Jon E. Froehlich, Eunyee Koh, Chang Xiao

TL;DR

SonifyAR tackles the absence of context-aware AR sound by introducing a PbD-based workflow that automatically captures event context and leverages an LLM to orchestrate four sound acquisition methods: recommendation, retrieval, generation, and transfer. By textualizing context from user actions, virtual objects, and real-world surfaces, the system prompts sound assets that are calibrated to materials and animations, enabling in-situ sound generation for complex AR interactions. A modular implementation using ARKit, Dense Material Segmentation, AudioLDM, and GPT-4 demonstrates usability in an eight-designer study and five application scenarios, including education and accessibility, while highlighting areas for improvement in sound quality and UI design. The work spotlights a practical path toward more immersive, accessible, and efficient AR sound authoring, with broad implications for AR content creation and headset safety applications.

Abstract

Sound plays a crucial role in enhancing user experience and immersiveness in Augmented Reality (AR). However, current platforms lack support for AR sound authoring due to limited interaction types, challenges in collecting and specifying context information, and difficulty in acquiring matching sound assets. We present SonifyAR, an LLM-based AR sound authoring system that generates context-aware sound effects for AR experiences. SonifyAR expands the current design space of AR sound and implements a Programming by Demonstration (PbD) pipeline to automatically collect contextual information of AR events, including virtual content semantics and real world context. This context information is then processed by a large language model to acquire sound effects with Recommendation, Retrieval, Generation, and Transfer methods. To evaluate the usability and performance of our system, we conducted a user study with eight participants and created five example applications, including an AR-based science experiment, an improving case for AR headset safety, and an assisting example for low vision AR users.

SonifyAR: Context-Aware Sound Generation in Augmented Reality

TL;DR

SonifyAR tackles the absence of context-aware AR sound by introducing a PbD-based workflow that automatically captures event context and leverages an LLM to orchestrate four sound acquisition methods: recommendation, retrieval, generation, and transfer. By textualizing context from user actions, virtual objects, and real-world surfaces, the system prompts sound assets that are calibrated to materials and animations, enabling in-situ sound generation for complex AR interactions. A modular implementation using ARKit, Dense Material Segmentation, AudioLDM, and GPT-4 demonstrates usability in an eight-designer study and five application scenarios, including education and accessibility, while highlighting areas for improvement in sound quality and UI design. The work spotlights a practical path toward more immersive, accessible, and efficient AR sound authoring, with broad implications for AR content creation and headset safety applications.

Abstract

Sound plays a crucial role in enhancing user experience and immersiveness in Augmented Reality (AR). However, current platforms lack support for AR sound authoring due to limited interaction types, challenges in collecting and specifying context information, and difficulty in acquiring matching sound assets. We present SonifyAR, an LLM-based AR sound authoring system that generates context-aware sound effects for AR experiences. SonifyAR expands the current design space of AR sound and implements a Programming by Demonstration (PbD) pipeline to automatically collect contextual information of AR events, including virtual content semantics and real world context. This context information is then processed by a large language model to acquire sound effects with Recommendation, Retrieval, Generation, and Transfer methods. To evaluate the usability and performance of our system, we conducted a user study with eight participants and created five example applications, including an AR-based science experiment, an improving case for AR headset safety, and an assisting example for low vision AR users.
Paper Structure (33 sections, 11 figures)

This paper contains 33 sections, 11 figures.

Figures (11)

  • Figure 1: The sound-producing opportunities in the triad of User, Virtuality and Reality.
  • Figure 2: Overview of the pipeline of SonifyAR. Our system monitors and logs context information of AR events, which includes the event type, the subjects and objects (virtual or real-world), and the attributes of the involved elements like their materials. This information is compiled into a text template and then processed by our LLM controller to acquire sound assets. The results are subsequently presented in our selection panel.
  • Figure 3: SonifyAR's event textualization. Left: a user tapping on a virtual cup through the SonifyAR interface; Right: the textual context information extracted by SonifyAR's internal PbD framework.
  • Figure 4: SonifyAR's sound acquisition pipeline.
  • Figure 5: SonifyAR's authoring interface. Left: SonifyAR's phone-based AR interface. Right: SonifyAR's authoring panel.
  • ...and 6 more figures