Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

Yiwei Li; Zihao Wu; Yanjun Lv; Hanqi Jiang; Weihang You; Zhengliang Liu; Dajiang Zhu; Xiang Li; Quanzheng Li; Tianming Liu; Lin Zhao

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

Yiwei Li, Zihao Wu, Yanjun Lv, Hanqi Jiang, Weihang You, Zhengliang Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Tianming Liu, Lin Zhao

TL;DR

Eye-gaze is used as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens that are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration.

Abstract

Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an effective supervision signal for learning visually grounded medical reasoning.

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

TL;DR

Abstract

Paper Structure (22 sections, 7 equations, 2 figures, 2 tables)

This paper contains 22 sections, 7 equations, 2 figures, 2 tables.

Introduction
Related Work
Thinking with Images and Latent Visual Reasoning
Eye-gaze Supervision in Radiology
Method
Problem Formulation
Dataset and Multi-modal Preprocessing (MIMIC-EYE)
Audio--text--gaze temporal alignment.
From gaze to patch indices.
Model Architecture
Backbone VLM.
Fixed-format generation with latent gaze tokens.
Gaze projection head.
14-label classifier head.
Two-stage Training Objective
...and 7 more sections

Figures (2)

Figure 1: Method overview. We fine-tune a pretrained VLM with MIMIC-EYE by injecting gaze supervision as discrete patch indices. Stage 1 learns four dedicated gaze tokens via a lightweight projection head that predicts gaze-selected patch IDs (cross-entropy). Stage 2 adds a 14-label classifier head to predict radiographic findings (binary cross-entropy) while enforcing a strict fixed-format yes/no output.
Figure 2: Eye-gaze reasoning trajectories. Two MIMIC-EYE examples showing temporally ordered gaze heatmaps overlaid on the chest X-ray. Each sequence visualizes how attention evolves from Step 1 to Step 4, illustrating radiologists' sequential evidence acquisition during interpretation.

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

TL;DR

Abstract

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (2)