Table of Contents
Fetching ...

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, Dezhi Luo

TL;DR

This study investigates whether Vision Language Models can infer human gaze direction, a core component of Theory of Mind, using a tightly controlled gaze-referent task evaluated across 111 VLMs and 65 humans. By constraining stimuli to isolate core gaze inference and employing a four-option decision framework, the authors find that the majority of VLMs perform at chance ($0.42$ baseline) while humans attain near-ceiling accuracy; a small group of top-tier VLMs exceed chance but remain well below human performance. The results suggest that these models primarily rely on head orientation rather than eye appearance for gaze inference, with error patterns that differ markedly from humans, indicating qualitatively different inference mechanisms. Overall, the work highlights current limits of gaze understanding in VLMs for natural human–AI interactions and outlines directions for developing gaze-direction-aware agents in the future.

Abstract

The ability to infer what others are looking at is a critical component of a theory of mind that underpins natural human-AI interaction. We characterized this skill in 111 Vision Language Models (VLMs) and human participants (N = 65) using photos taken with manipulated difficulty and variability. We found that 94 of the 111 VLMs were not better than random guessing, while humans achieved near-ceiling accuracy. VLMs respond with each choice almost equally frequently. Are they randomly guessing? At least for five top-tier VLMs, their performance was above chance, declined with increasing task difficulty, but barely varied across different prompts and scene objects. These behavioral patterns cannot be explained by considering VLMs as random guessers. Instead, they likely utilize head orientation but not eye appearance to infer gaze direction, such that their performance is imperfect, subject to the task difficulty, but robust to superficial perceptual variations. This suggests that VLMs, lacking effective gaze inference skills, have yet to become technologies that can naturally interact with humans, but the potential remains.

Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

TL;DR

This study investigates whether Vision Language Models can infer human gaze direction, a core component of Theory of Mind, using a tightly controlled gaze-referent task evaluated across 111 VLMs and 65 humans. By constraining stimuli to isolate core gaze inference and employing a four-option decision framework, the authors find that the majority of VLMs perform at chance ( baseline) while humans attain near-ceiling accuracy; a small group of top-tier VLMs exceed chance but remain well below human performance. The results suggest that these models primarily rely on head orientation rather than eye appearance for gaze inference, with error patterns that differ markedly from humans, indicating qualitatively different inference mechanisms. Overall, the work highlights current limits of gaze understanding in VLMs for natural human–AI interactions and outlines directions for developing gaze-direction-aware agents in the future.

Abstract

The ability to infer what others are looking at is a critical component of a theory of mind that underpins natural human-AI interaction. We characterized this skill in 111 Vision Language Models (VLMs) and human participants (N = 65) using photos taken with manipulated difficulty and variability. We found that 94 of the 111 VLMs were not better than random guessing, while humans achieved near-ceiling accuracy. VLMs respond with each choice almost equally frequently. Are they randomly guessing? At least for five top-tier VLMs, their performance was above chance, declined with increasing task difficulty, but barely varied across different prompts and scene objects. These behavioral patterns cannot be explained by considering VLMs as random guessers. Instead, they likely utilize head orientation but not eye appearance to infer gaze direction, such that their performance is imperfect, subject to the task difficulty, but robust to superficial perceptual variations. This suggests that VLMs, lacking effective gaze inference skills, have yet to become technologies that can naturally interact with humans, but the potential remains.

Paper Structure

This paper contains 47 sections, 1 equation, 21 figures, 4 tables.

Figures (21)

  • Figure 1: (a) The gaze referential inference task (arrow added for illustration). (b) 99.9% confidence intervals of the accuracy means are depicted. A random-guessing machine achieves an accuracy of around 42%. A performance gap exists between top-tier Vision Language Models and humans.
  • Figure 2: Systematic manipulation of View (left/right/front), Proximity (1-3 scale), #Objects (2-4), Objects (18 combinations of 9 distinct items), and Gazer (2 actresses) across 900 test stimuli. Stimuli in subfigure (c) have a Proximity value of 2. Here Gazer=ActressX.
  • Figure 3: A row in a confusion matrix indicates the proportion of trials across all 111 VLMs (or human participants) that were responded with the column object (e.g., the coffee for columns with a coffee emoji) among trials in which the correct answer is the row object (e.g., the doll for rows with a doll emoji). Overall, humans occasionally choose the object adjacent to the correct one, which the gazer looks at, while VLMs show a combination of a slight tendency towards certain items (e.g., the doll) and near-uniform sampling across other options (alternatively speaking, probability-matching to their priors).
  • Figure 4: No strong linear relation between VLM accuracy and release date was found.
  • Figure 5: The 95% confidence intervals for linear regression are drawn as shaded areas. Standard deviations are reported for variables drawn as horizontal lines.
  • ...and 16 more figures