GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

Jaewook Lee; Jun Wang; Elizabeth Brown; Liam Chu; Sebastian S. Rodriguez; Jon E. Froehlich

GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

Jaewook Lee, Jun Wang, Elizabeth Brown, Liam Chu, Sebastian S. Rodriguez, Jon E. Froehlich

TL;DR

GazePointAR addresses pronoun disambiguation for voice assistants in wearable AR by fusing real-time eye gaze, pointing gestures, and conversation history with large-language-model reasoning. The system modifies user queries to replace pronouns with explicit referents and queries an LLM (GPT-3/GPT-3.5-turbo) to generate natural, context-aware answers read aloud via TTS. Two studies—an in-lab three-part evaluation and a five-day in-the-wild deployment—demonstrate gains in naturalness and immediacy over traditional VAs, while revealing limitations in gaze data handling, explainability, and multi-referent queries. The work advances context-aware VA design for AR wearables and offers design implications and future directions for incorporating continuous gaze tracking, user autonomy, and richer multimodal explanations in real-world settings.

Abstract

Voice assistants (VAs) like Siri and Alexa are transforming human-computer interaction; however, they lack awareness of users' spatiotemporal context, resulting in limited performance and unnatural dialogue. We introduce GazePointAR, a fully-functional context-aware VA for wearable augmented reality that leverages eye gaze, pointing gestures, and conversation history to disambiguate speech queries. With GazePointAR, users can ask "what's over there?" or "how do I solve this math problem?" simply by looking and/or pointing. We evaluated GazePointAR in a three-part lab study (N=12): (1) comparing GazePointAR to two commercial systems; (2) examining GazePointAR's pronoun disambiguation across three tasks; (3) and an open-ended phase where participants could suggest and try their own context-sensitive queries. Participants appreciated the naturalness and human-like nature of pronoun-driven queries, although sometimes pronoun use was counter-intuitive. We then iterated on GazePointAR and conducted a first-person diary study examining how GazePointAR performs in-the-wild. We conclude by enumerating limitations and design considerations for future context-aware VAs.

GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

TL;DR

Abstract

Paper Structure (28 sections, 9 figures, 4 tables)

This paper contains 28 sections, 9 figures, 4 tables.

Introduction
Related Work
Pronoun Usage in Speech
Multimodal Interaction
Multimodal Interaction with Voice Assistants
Multimodal Interaction in Augmented Reality
Other Uses of Gaze, Pointing, and Speech in Wearable AR
GazePointAR Prototype 1
Taxonomy of Pronoun Use and Resolution
System Implementation
Study 1: Three-Part Lab Evaluation of GazePointAR
Participants
Procedure
Data and Analysis
Findings
...and 13 more sections

Figures (9)

Figure 1: System overview and implementation details of GazePointAR
Figure 2: Cooking scenario and the three VAs used in Part 1 of the study.
Figure 3: Usage scenarios in Part 2 of the study.
Figure 4: Design probes in Part 3 of the study. See supplementary materials for the videos.
Figure 5: The mean and standard deviation of task time, usability, perceived intelligence, helpfulness, naturalness, and overall preference. Task Time is in seconds. Usability is 0-100; higher the better. Rankings are 1-3; lower is better. For statistical significance, one asterisk (*) is $p < 0.05$; two asterisks (**) is $p < 0.01$.
...and 4 more figures

GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

TL;DR

Abstract

GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

Authors

TL;DR

Abstract

Table of Contents

Figures (9)