Table of Contents
Fetching ...

GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

Jaewook Lee, Jun Wang, Elizabeth Brown, Liam Chu, Sebastian S. Rodriguez, Jon E. Froehlich

TL;DR

GazePointAR addresses pronoun disambiguation for voice assistants in wearable AR by fusing real-time eye gaze, pointing gestures, and conversation history with large-language-model reasoning. The system modifies user queries to replace pronouns with explicit referents and queries an LLM (GPT-3/GPT-3.5-turbo) to generate natural, context-aware answers read aloud via TTS. Two studies—an in-lab three-part evaluation and a five-day in-the-wild deployment—demonstrate gains in naturalness and immediacy over traditional VAs, while revealing limitations in gaze data handling, explainability, and multi-referent queries. The work advances context-aware VA design for AR wearables and offers design implications and future directions for incorporating continuous gaze tracking, user autonomy, and richer multimodal explanations in real-world settings.

Abstract

Voice assistants (VAs) like Siri and Alexa are transforming human-computer interaction; however, they lack awareness of users' spatiotemporal context, resulting in limited performance and unnatural dialogue. We introduce GazePointAR, a fully-functional context-aware VA for wearable augmented reality that leverages eye gaze, pointing gestures, and conversation history to disambiguate speech queries. With GazePointAR, users can ask "what's over there?" or "how do I solve this math problem?" simply by looking and/or pointing. We evaluated GazePointAR in a three-part lab study (N=12): (1) comparing GazePointAR to two commercial systems; (2) examining GazePointAR's pronoun disambiguation across three tasks; (3) and an open-ended phase where participants could suggest and try their own context-sensitive queries. Participants appreciated the naturalness and human-like nature of pronoun-driven queries, although sometimes pronoun use was counter-intuitive. We then iterated on GazePointAR and conducted a first-person diary study examining how GazePointAR performs in-the-wild. We conclude by enumerating limitations and design considerations for future context-aware VAs.

GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality

TL;DR

GazePointAR addresses pronoun disambiguation for voice assistants in wearable AR by fusing real-time eye gaze, pointing gestures, and conversation history with large-language-model reasoning. The system modifies user queries to replace pronouns with explicit referents and queries an LLM (GPT-3/GPT-3.5-turbo) to generate natural, context-aware answers read aloud via TTS. Two studies—an in-lab three-part evaluation and a five-day in-the-wild deployment—demonstrate gains in naturalness and immediacy over traditional VAs, while revealing limitations in gaze data handling, explainability, and multi-referent queries. The work advances context-aware VA design for AR wearables and offers design implications and future directions for incorporating continuous gaze tracking, user autonomy, and richer multimodal explanations in real-world settings.

Abstract

Voice assistants (VAs) like Siri and Alexa are transforming human-computer interaction; however, they lack awareness of users' spatiotemporal context, resulting in limited performance and unnatural dialogue. We introduce GazePointAR, a fully-functional context-aware VA for wearable augmented reality that leverages eye gaze, pointing gestures, and conversation history to disambiguate speech queries. With GazePointAR, users can ask "what's over there?" or "how do I solve this math problem?" simply by looking and/or pointing. We evaluated GazePointAR in a three-part lab study (N=12): (1) comparing GazePointAR to two commercial systems; (2) examining GazePointAR's pronoun disambiguation across three tasks; (3) and an open-ended phase where participants could suggest and try their own context-sensitive queries. Participants appreciated the naturalness and human-like nature of pronoun-driven queries, although sometimes pronoun use was counter-intuitive. We then iterated on GazePointAR and conducted a first-person diary study examining how GazePointAR performs in-the-wild. We conclude by enumerating limitations and design considerations for future context-aware VAs.
Paper Structure (28 sections, 9 figures, 4 tables)

This paper contains 28 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: System overview and implementation details of GazePointAR
  • Figure 2: Cooking scenario and the three VAs used in Part 1 of the study.
  • Figure 3: Usage scenarios in Part 2 of the study.
  • Figure 4: Design probes in Part 3 of the study. See supplementary materials for the videos.
  • Figure 5: The mean and standard deviation of task time, usability, perceived intelligence, helpfulness, naturalness, and overall preference. Task Time is in seconds. Usability is 0-100; higher the better. Rankings are 1-3; lower is better. For statistical significance, one asterisk (*) is $p < 0.05$; two asterisks (**) is $p < 0.01$.
  • ...and 4 more figures