Table of Contents
Fetching ...

GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

Athul M. Mathew, Haithem Hermassi, Thariq Khalid, Arshad Ali Khan, Riad Souissi

TL;DR

GazeVLM introduces a unified vision–language model for multi-task gaze understanding, integrating person detection, gaze target localization, and gaze object identification within a single framework. By fusing RGB imagery with $HHA$-encoded depth through cross-attention in a frozen vision encoder (Qwen2-VL) and a text decoder, the model accepts natural-language prompts to selectively perform tasks. It converts dataset annotations into text prompts and depth-based, geometry-rich inputs, and introduces an object-level gaze metric $AP_{ob}$, enabling robust evaluation on static and dynamic gaze datasets. Experiments on GazeFollow and VideoAttentionTarget demonstrate state-of-the-art performance and the efficacy of depth-informed fusion, suggesting strong potential for robust, real-time gaze analytics in human–computer interaction and related domains.

Abstract

Gaze understanding unifies the detection of people, their gaze targets, and objects of interest into a single framework, offering critical insight into visual attention and intent estimation. Although prior research has modelled gaze cues in visual scenes, a unified system is still needed for gaze understanding using both visual and language prompts. This paper introduces GazeVLM, a novel Vision-Language Model (VLM) for multi-task gaze understanding in images, addressing person detection, gaze target detection, and gaze object identification. While other transformer-based methods exist for gaze analysis, GazeVLM represents, to our knowledge, the first application of a VLM to these combined tasks, allowing for selective execution of each task. Through the integration of visual (RGB and depth) and textual modalities, our ablation study on visual input combinations revealed that a fusion of RGB images with HHA-encoded depth maps, guided by text prompts, yields superior performance. We also introduce an object-level gaze detection metric for gaze object identification ($AP_{ob}$). Through experiments, GazeVLM demonstrates significant improvements, notably achieving state-of-the-art evaluation scores on GazeFollow and VideoAttentionTarget datasets.

GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

TL;DR

GazeVLM introduces a unified vision–language model for multi-task gaze understanding, integrating person detection, gaze target localization, and gaze object identification within a single framework. By fusing RGB imagery with -encoded depth through cross-attention in a frozen vision encoder (Qwen2-VL) and a text decoder, the model accepts natural-language prompts to selectively perform tasks. It converts dataset annotations into text prompts and depth-based, geometry-rich inputs, and introduces an object-level gaze metric , enabling robust evaluation on static and dynamic gaze datasets. Experiments on GazeFollow and VideoAttentionTarget demonstrate state-of-the-art performance and the efficacy of depth-informed fusion, suggesting strong potential for robust, real-time gaze analytics in human–computer interaction and related domains.

Abstract

Gaze understanding unifies the detection of people, their gaze targets, and objects of interest into a single framework, offering critical insight into visual attention and intent estimation. Although prior research has modelled gaze cues in visual scenes, a unified system is still needed for gaze understanding using both visual and language prompts. This paper introduces GazeVLM, a novel Vision-Language Model (VLM) for multi-task gaze understanding in images, addressing person detection, gaze target detection, and gaze object identification. While other transformer-based methods exist for gaze analysis, GazeVLM represents, to our knowledge, the first application of a VLM to these combined tasks, allowing for selective execution of each task. Through the integration of visual (RGB and depth) and textual modalities, our ablation study on visual input combinations revealed that a fusion of RGB images with HHA-encoded depth maps, guided by text prompts, yields superior performance. We also introduce an object-level gaze detection metric for gaze object identification (). Through experiments, GazeVLM demonstrates significant improvements, notably achieving state-of-the-art evaluation scores on GazeFollow and VideoAttentionTarget datasets.

Paper Structure

This paper contains 25 sections, 17 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A comparison of previous methods versus our approach.
  • Figure 2: Dataset format example. Each task statement is marked with special tokens <im_start> and <im_end>. Image features are separated from text features using special tokens <vision_start> and <vision_end>, in line with ChatMLopenai_chatml format.
  • Figure 3: Overview of GazeVLM. The model processes an input comprised of an RGB image and a corresponding HHA-encoded depth map. Our model is multi-task and can detects individuals, their gaze points, and the objects they focus on, based on the task provided by the user input prompt.
  • Figure 4: Qualitative results from GazeFollow and VideoAttentionTarget datasets. Each column denotes an example image in a series of multi-turn user input prompts and model responses. User input prompt and model response for every image is highlighted row-wise using and icons respectively. The example in fourth column also demonstrates a scenario for gaze in/out classification. The model responds with a textual tag "looking out of the image" if the person gaze is not within the field of view of the scene image. All model responses are spatially located and color-coded for easier visualization.