G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios
Zeyu Wang, Yuanchun Shi, Yuntao Wang, Yuchen Yao, Kun Yan, Yuhan Wang, Lei Ji, Xuhai Xu, Chun Yu
TL;DR
G-VOILA introduces a gaze-facilitated information querying paradigm that combines gaze data, visual field, and voice-based natural language queries to support in-situ information retrieval in daily life. Through a user-enactment study (n=21) and a design framework, the work reveals how gaze patterns augment language expressions and provide contextual grounding, including a pronounced mouth–eye coordination during query formulation. A proof-of-concept VOILA-G system demonstrates that incorporating gaze improves recall and precision over gaze-free baselines, with users reporting higher perceived usefulness and satisfaction. The study offers a rigorous framework for integrating gaze into multimodal information querying and discusses practical considerations for deployment and future modalities of input beyond gaze and voice.
Abstract
Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.
