Table of Contents
Fetching ...

Fine-Grained Prediction of Reading Comprehension from Eye Movements

Omer Shubi, Yoav Meiri, Cfir Avraham Hadar, Yevgeni Berzak

TL;DR

This work investigates whether eye movements can enable fine-grained prediction of reading comprehension at the level of a single question over a paragraph. It introduces the large-scale OneStop Eye Movements dataset and three multimodal transformer models (RoBERTa-QEye, MAG-QEye, PostFusion-QEye) that fuse gaze data with text, evaluated under ordinary reading and information-seeking regimes. Across extensive cross-validation with strict generalization tests, eye movements provide informative but modest gains over strong text-only baselines, with gains varying by regime and task. The findings highlight both the potential and limits of gaze-based comprehension assessment, emphasizing the need for more data, robust multimodal architectures, and careful baseline comparisons for reliable deployment.

Abstract

Can human reading comprehension be assessed from eye movements in reading? In this work, we address this longstanding question using large-scale eyetracking data over textual materials that are geared towards behavioral analyses of reading comprehension. We focus on a fine-grained and largely unaddressed task of predicting reading comprehension from eye movements at the level of a single question over a passage. We tackle this task using three new multimodal language models, as well as a battery of prior models from the literature. We evaluate the models' ability to generalize to new textual items, new participants, and the combination of both, in two different reading regimes, ordinary reading and information seeking. The evaluations suggest that although the task is highly challenging, eye movements contain useful signals for fine-grained prediction of reading comprehension. Code and data will be made publicly available.

Fine-Grained Prediction of Reading Comprehension from Eye Movements

TL;DR

This work investigates whether eye movements can enable fine-grained prediction of reading comprehension at the level of a single question over a paragraph. It introduces the large-scale OneStop Eye Movements dataset and three multimodal transformer models (RoBERTa-QEye, MAG-QEye, PostFusion-QEye) that fuse gaze data with text, evaluated under ordinary reading and information-seeking regimes. Across extensive cross-validation with strict generalization tests, eye movements provide informative but modest gains over strong text-only baselines, with gains varying by regime and task. The findings highlight both the potential and limits of gaze-based comprehension assessment, emphasizing the need for more data, robust multimodal architectures, and careful baseline comparisons for reliable deployment.

Abstract

Can human reading comprehension be assessed from eye movements in reading? In this work, we address this longstanding question using large-scale eyetracking data over textual materials that are geared towards behavioral analyses of reading comprehension. We focus on a fine-grained and largely unaddressed task of predicting reading comprehension from eye movements at the level of a single question over a passage. We tackle this task using three new multimodal language models, as well as a battery of prior models from the literature. We evaluate the models' ability to generalize to new textual items, new participants, and the combination of both, in two different reading regimes, ordinary reading and information seeking. The evaluations suggest that although the task is highly challenging, eye movements contain useful signals for fine-grained prediction of reading comprehension. Code and data will be made publicly available.
Paper Structure (36 sections, 8 equations, 3 figures, 9 tables)

This paper contains 36 sections, 8 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Left: an example of an eye movement trajectory over a paragraph, where red circles represent fixations, and blue arrows represent saccades. Right: a schematic depiction of word-level feature extraction, resulting in a vector $E_{w_i}$: an eye movements and linguistic word properties feature representation for each word.
  • Figure 2: Model architectures. (a) RoBERTa-QEye treats eye movements as additional input features. (b) MAG-QEye uses eye movement information to modify contextualized word representations. (c) PostFusion-QEye processes text and eye movements separately and combines them via cross-attention mechanisms. Model input: $Eyes^P$ represents the participant's eye movements over the paragraph $p$, $q^p$ is a question and $[Ans^{q^p}]$ are optional answer choices which are provided only in the multiple choice version of the task.
  • Figure 3: A schematic depiction of a 10-article 60-participant batch split, divided into a train set, a validation set, and the three test sets. A full data split for a reading regime (ordinary reading or information seeking) consists of the union of three batch splits.