GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality
Jaewook Lee, Jun Wang, Elizabeth Brown, Liam Chu, Sebastian S. Rodriguez, Jon E. Froehlich
TL;DR
GazePointAR addresses pronoun disambiguation for voice assistants in wearable AR by fusing real-time eye gaze, pointing gestures, and conversation history with large-language-model reasoning. The system modifies user queries to replace pronouns with explicit referents and queries an LLM (GPT-3/GPT-3.5-turbo) to generate natural, context-aware answers read aloud via TTS. Two studies—an in-lab three-part evaluation and a five-day in-the-wild deployment—demonstrate gains in naturalness and immediacy over traditional VAs, while revealing limitations in gaze data handling, explainability, and multi-referent queries. The work advances context-aware VA design for AR wearables and offers design implications and future directions for incorporating continuous gaze tracking, user autonomy, and richer multimodal explanations in real-world settings.
Abstract
Voice assistants (VAs) like Siri and Alexa are transforming human-computer interaction; however, they lack awareness of users' spatiotemporal context, resulting in limited performance and unnatural dialogue. We introduce GazePointAR, a fully-functional context-aware VA for wearable augmented reality that leverages eye gaze, pointing gestures, and conversation history to disambiguate speech queries. With GazePointAR, users can ask "what's over there?" or "how do I solve this math problem?" simply by looking and/or pointing. We evaluated GazePointAR in a three-part lab study (N=12): (1) comparing GazePointAR to two commercial systems; (2) examining GazePointAR's pronoun disambiguation across three tasks; (3) and an open-ended phase where participants could suggest and try their own context-sensitive queries. Participants appreciated the naturalness and human-like nature of pronoun-driven queries, although sometimes pronoun use was counter-intuitive. We then iterated on GazePointAR and conducted a first-person diary study examining how GazePointAR performs in-the-wild. We conclude by enumerating limitations and design considerations for future context-aware VAs.
