Table of Contents
Fetching ...

In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

Taiying Peng, Jiacheng Hua, Miao Liu, Feng Lu

TL;DR

This work introduces EgoGazeVQA, the first egocentric video QA benchmark that leverages user gaze to infer intent. It provides a three-stage benchmark construction pipeline using frame captions and gaze data, generated QA pairs by an MLLM, and subsequent human validation, drawing from Ego4D, EgoExo4D, and EGTEA Gaze+ datasets. The study proposes three gaze-guided prompting strategies (GazeT, GazeV, GazeS) and demonstrates that gaze cues improve spatial, temporal, and causal reasoning in MLLMs, with larger models deriving the most benefit. A cross-dataset LoRA fine-tuning analysis shows that limited gaze-conditioned data can significantly boost performance, especially for spatial reasoning, and highlights the importance of accurate gaze estimation for gains. Overall, EgoGazeVQA advances proactive, personalized egocentric AI by grounding MLLMs in explicit gaze signals and opens avenues for gaze-aware model interpretability and multi-sensory alignment.

Abstract

The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings. Project page: https://taiyi98.github.io/projects/EgoGazeVQA

In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

TL;DR

This work introduces EgoGazeVQA, the first egocentric video QA benchmark that leverages user gaze to infer intent. It provides a three-stage benchmark construction pipeline using frame captions and gaze data, generated QA pairs by an MLLM, and subsequent human validation, drawing from Ego4D, EgoExo4D, and EGTEA Gaze+ datasets. The study proposes three gaze-guided prompting strategies (GazeT, GazeV, GazeS) and demonstrates that gaze cues improve spatial, temporal, and causal reasoning in MLLMs, with larger models deriving the most benefit. A cross-dataset LoRA fine-tuning analysis shows that limited gaze-conditioned data can significantly boost performance, especially for spatial reasoning, and highlights the importance of accurate gaze estimation for gains. Overall, EgoGazeVQA advances proactive, personalized egocentric AI by grounding MLLMs in explicit gaze signals and opens avenues for gaze-aware model interpretability and multi-sensory alignment.

Abstract

The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings. Project page: https://taiyi98.github.io/projects/EgoGazeVQA

Paper Structure

This paper contains 29 sections, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: We propose EgoGazeVQA, the first MLLM benchmark that incorporates essential gaze signals for understanding user intent in egocentric settings. We present examples of Spatial, Temporal, and Causal Intent QA, demonstrating how gaze information improves MLLMs' performance. Correct and incorrect predictions by MLLMs with and without gaze cues are highlighted. Radar charts compare model performance across different scenarios and activities, showing consistent gains with our gaze-guide prompting strategy.
  • Figure 2: Construction pipeline of the EgoGazeVQA. We craft the EgoGazeVQA in three steps. Stage 1: Egocentric video clips are processed to extract frame captions and gaze coordinates to capture user focus. Stage 2: A MLLM model generates spatial/temporal-ware and intention-related Q&A pairs using a customized prompt. Stage 3: Human annotators manually review the generated Q&A pairs for several important quality dimensions to ensure high-quality data.
  • Figure 3: Gaze-guided prompting strategies in EgoGazeVQA. We experiment three gaze-guided prompting strategies on the EgoGazeVQA benchmark: Gaze as Textual Prompt (left), where gaze coordinates are presented as text inputs to guide model responses; Gaze as Visual Prompt (center), which highlights gaze points directly on video frames to inform answer selection; and Gaze Salience Maps as Prompt (right), utilizing heatmaps of gaze trajectories to provide contextual cues for understanding spatial and temporal intent. We evaluate demonstrate each strategy to enhance the MLLM's ability to interpret user focus and intent more accurately in Table \ref{['tab:expr_modal']}.
  • Figure 4: Visual examples from our EgoGazeVQA benchmark and gaze-guided prompting
  • Figure 5: We present examples from our EgoGazeVQA benchmark, illustrating how gaze information influences model predictions. Each example includes the question, multiple-choice options, and the ground truth, with correct and incorrect predictions highlighted for from prevailing MLLMs with and without gaze-guided prompting.