Auptimize: Optimal Placement of Spatial Audio Cues for Extended Reality

Hyunsung Cho; Alexander Wang; Divya Kartik; Emily Liying Xie; Yukang Yan; David Lindlbauer

Auptimize: Optimal Placement of Spatial Audio Cues for Extended Reality

Hyunsung Cho, Alexander Wang, Divya Kartik, Emily Liying Xie, Yukang Yan, David Lindlbauer

TL;DR

Auptimize tackles localization errors in XR spatial audio by disentangling audio cue locations from their visual hosts and relocating cues to optimal azimuth positions. It combines a data-driven analyzer (estimating localization blur $P(V|S)$ and cone-of-confusion distance $D(V,S)$) with an integer-programming optimizer that assigns audio cues to 12°-bin locations $S^*$ to maximize a weighted objective. Empirical evaluation shows Auptimize outperforms generic head-related transfer function (HRTF) and dynamic audio baselines in source identification accuracy and reduces response times, demonstrating practical benefits for multi-source audio guidance in XR. The approach establishes a foundational method for reliable audio-visual integration in spatial interfaces and points to future work on personalization, complex scenes, and additional audio-visual cues.

Abstract

Spatial audio in Extended Reality (XR) provides users with better awareness of where virtual elements are placed, and efficiently guides them to events such as notifications, system alerts from different windows, or approaching avatars. Humans, however, are inaccurate in localizing sound cues, especially with multiple sources due to limitations in human auditory perception such as angular discrimination error and front-back confusion. This decreases the efficiency of XR interfaces because users misidentify from which XR element a sound is coming. To address this, we propose Auptimize, a novel computational approach for placing XR sound sources, which mitigates such localization errors by utilizing the ventriloquist effect. Auptimize disentangles the sound source locations from the visual elements and relocates the sound sources to optimal positions for unambiguous identification of sound cues, avoiding errors due to inter-source proximity and front-back confusion. Our evaluation shows that Auptimize decreases spatial audio-based source identification errors compared to playing sound cues at the paired visual-sound locations. We demonstrate the applicability of Auptimize for diverse spatial audio-based interactive XR scenarios.

Auptimize: Optimal Placement of Spatial Audio Cues for Extended Reality

TL;DR

and cone-of-confusion distance

) with an integer-programming optimizer that assigns audio cues to 12°-bin locations

to maximize a weighted objective. Empirical evaluation shows Auptimize outperforms generic head-related transfer function (HRTF) and dynamic audio baselines in source identification accuracy and reduces response times, demonstrating practical benefits for multi-source audio guidance in XR. The approach establishes a foundational method for reliable audio-visual integration in spatial interfaces and points to future work on personalization, complex scenes, and additional audio-visual cues.

Abstract

Paper Structure (47 sections, 7 equations, 12 figures, 2 tables)

This paper contains 47 sections, 7 equations, 12 figures, 2 tables.

Introduction
Background and Related work
How Humans Localize Sound
Ventriloquist Effect
Localization in Digital Systems
Spatial Audio for Interactive Systems
Spatial Audio in XR
Audio Cues
Data Collection
Study Design and Procedure
Results
Localization Blur
Cone of Confusion
Optimal Sound (Dis)placement
Aggregated Probability vs. Individual Users
...and 32 more sections

Figures (12)

Figure 1: Spherical and rectangular coordinate systems used in this work.
Figure 2: Interaural time differences (ITD) per azimuth angle.
Figure 3: Cone of confusion, where the points at the cross-sections of the cone have the same interaural time differences.
Figure 4: Data collection environment of spatial audio localization. Participants were seated, and asked to point at the location of sound sources on the semi-transparent sphere with checker board.
Figure 5: Confusion matrix of azimuth localization in the data collection study, discretized into 30 bins of size 12°. Each cell represents the probability of a sound played at a true azimuth angle $\theta$ to be perceived as coming from the predicted azimuth angle $\hat{\theta}$. Cells in purple ellipses indicate the effects of localization blur, and cells in orange ellipses exhibit the effects of localization blur around the cone of confusion.
...and 7 more figures

Auptimize: Optimal Placement of Spatial Audio Cues for Extended Reality

TL;DR

Abstract

Auptimize: Optimal Placement of Spatial Audio Cues for Extended Reality

Authors

TL;DR

Abstract

Table of Contents

Figures (12)