GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

Athul M. Mathew; Haithem Hermassi; Thariq Khalid; Arshad Ali Khan; Riad Souissi

GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

Athul M. Mathew, Haithem Hermassi, Thariq Khalid, Arshad Ali Khan, Riad Souissi

TL;DR

GazeVLM introduces a unified vision–language model for multi-task gaze understanding, integrating person detection, gaze target localization, and gaze object identification within a single framework. By fusing RGB imagery with $HHA$-encoded depth through cross-attention in a frozen vision encoder (Qwen2-VL) and a text decoder, the model accepts natural-language prompts to selectively perform tasks. It converts dataset annotations into text prompts and depth-based, geometry-rich inputs, and introduces an object-level gaze metric $AP_{ob}$, enabling robust evaluation on static and dynamic gaze datasets. Experiments on GazeFollow and VideoAttentionTarget demonstrate state-of-the-art performance and the efficacy of depth-informed fusion, suggesting strong potential for robust, real-time gaze analytics in human–computer interaction and related domains.

Abstract

Gaze understanding unifies the detection of people, their gaze targets, and objects of interest into a single framework, offering critical insight into visual attention and intent estimation. Although prior research has modelled gaze cues in visual scenes, a unified system is still needed for gaze understanding using both visual and language prompts. This paper introduces GazeVLM, a novel Vision-Language Model (VLM) for multi-task gaze understanding in images, addressing person detection, gaze target detection, and gaze object identification. While other transformer-based methods exist for gaze analysis, GazeVLM represents, to our knowledge, the first application of a VLM to these combined tasks, allowing for selective execution of each task. Through the integration of visual (RGB and depth) and textual modalities, our ablation study on visual input combinations revealed that a fusion of RGB images with HHA-encoded depth maps, guided by text prompts, yields superior performance. We also introduce an object-level gaze detection metric for gaze object identification ($AP_{ob}$). Through experiments, GazeVLM demonstrates significant improvements, notably achieving state-of-the-art evaluation scores on GazeFollow and VideoAttentionTarget datasets.

GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

TL;DR

Abstract

GazeVLM: A Vision-Language Model for Multi-Task Gaze Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)