G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

Zeyu Wang; Yuanchun Shi; Yuntao Wang; Yuchen Yao; Kun Yan; Yuhan Wang; Lei Ji; Xuhai Xu; Chun Yu

G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

Zeyu Wang, Yuanchun Shi, Yuntao Wang, Yuchen Yao, Kun Yan, Yuhan Wang, Lei Ji, Xuhai Xu, Chun Yu

TL;DR

G-VOILA introduces a gaze-facilitated information querying paradigm that combines gaze data, visual field, and voice-based natural language queries to support in-situ information retrieval in daily life. Through a user-enactment study (n=21) and a design framework, the work reveals how gaze patterns augment language expressions and provide contextual grounding, including a pronounced mouth–eye coordination during query formulation. A proof-of-concept VOILA-G system demonstrates that incorporating gaze improves recall and precision over gaze-free baselines, with users reporting higher perceived usefulness and satisfaction. The study offers a rigorous framework for integrating gaze into multimodal information querying and discusses practical considerations for deployment and future modalities of input beyond gaze and voice.

Abstract

Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.

G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

TL;DR

Abstract

Paper Structure (66 sections, 4 equations, 13 figures, 5 tables)

This paper contains 66 sections, 4 equations, 13 figures, 5 tables.

introduction
Related Works
Information Retrieval for Daily Scenarios
Contextual Computing for Interaction
Gaze-based Intention Reasoning
In-situ voice-based interaction
Multi-Modal Large Models
Integrated Systems Approach
Unified Vision-Language Model Approaches
Multimodal Integration for Alignment
The User-enactment study
Introducing G-VOILA: A Gaze-Facilitated Querying Paradigm
The User-Enactment Study
Scenarios and Participants.
Hardware Setup.
...and 51 more sections

Figures (13)

Figure 1: Illustrating G-VOILA use cases in a shopping scenario. A user wearing smart glasses is shopping for dinner. She reaches the vegetable aisle and desires detailed information about specific vegetables. Here is how an assistant within the G-VOILA paradigm might assists her: (1) Query Initiation: She posts queries through natural voice commands. (2) Contextual Analysis: The assistant analyzes the user's field of view and gaze to discern specific areas of interest. (3) Query Response: By further aligning the situational contexts with the posted questions, the assistant deduces her precise query intent and delivers a clear response.
Figure 2: Discussion chart for the user-enactment study. Headers and four illustrative examples were shown in the chart. For the first example, participant supposed G-VOILA could understand "it" to be a conical hat because (s)he was looking at it, whereas there is no clue showing that (s)he also wanted to know the manufacture process, thus G-VOILA should result in a lower matchness score. But when querying with her daily used platform google, she might rephrase her question to be more clarified.
Figure 3: Statistic results for the formative study, shown for all scenarios. (a) A bar plot displaying statistic count of queries for each category. All queries are classified it's ambiguity, which taxonomy is presented in Section \ref{['sec:language_pattern']} and Figure \ref{['fig:exp1_category']}. (b) Under the user-envisaged G-VOILA and text-based IR approaches, the hardness of seeking solutions and answer's matchness to user's query intent.
Figure 4: Breakdown of user queries by their anticipation of G-VOILA usage. "Lessons learned" are short for "Lessons learned from other well-designed system". At least one query example are provided for each category.
Figure 5: Exploring Inquiry Form-Factors with G-VOILA. The identified categories were shown in each card, with a checkmark indicating the required data for refilling user's query intent. Keywords in the query text are highlighted, and representative frames are selected from each query video.
...and 8 more figures

G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

TL;DR

Abstract

G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

Authors

TL;DR

Abstract

Table of Contents

Figures (13)