Table of Contents
Fetching ...

Decoding Reading Goals from Eye Movements

Omer Shubi, Cfir Avraham Hadar, Yevgeni Berzak

TL;DR

This work investigates whether reading goals can be decoded from eye movements by distinguishing information seeking from ordinary reading. It empirically evaluates a broad suite of models, including transformer-based architectures that fuse scanpath data with text, and introduces a logistic ensemble that combines model predictions. The results show that fixation-level, text-aware transformers yield top single-model performance, with online predictions feasible before a reader finishes a passage, and that ensembles provide additional gains. An innovative mixed-effects analysis interprets model errors and identifies textual and reader factors that drive task difficulty, advancing understanding of variability in eye-movement patterns across reading regimes and informing practical applications in education and assistive technologies.

Abstract

Readers can have different goals with respect to the text that they are reading. Can these goals be decoded from their eye movements over the text? In this work, we examine for the first time whether it is possible to distinguish between two types of common reading goals: information seeking and ordinary reading for comprehension. Using large-scale eye tracking data, we address this task with a wide range of models that cover different architectural and data representation strategies, and further introduce a new model ensemble. We find that transformer-based models with scanpath representations coupled with language modeling solve it most successfully, and that accurate predictions can be made in real time, long before the participant finished reading the text. We further introduce a new method for model performance analysis based on mixed effect modeling. Combining this method with rich textual annotations reveals key properties of textual items and participants that contribute to the difficulty of the task, and improves our understanding of the variability in eye movement patterns across the two reading regimes.

Decoding Reading Goals from Eye Movements

TL;DR

This work investigates whether reading goals can be decoded from eye movements by distinguishing information seeking from ordinary reading. It empirically evaluates a broad suite of models, including transformer-based architectures that fuse scanpath data with text, and introduces a logistic ensemble that combines model predictions. The results show that fixation-level, text-aware transformers yield top single-model performance, with online predictions feasible before a reader finishes a passage, and that ensembles provide additional gains. An innovative mixed-effects analysis interprets model errors and identifies textual and reader factors that drive task difficulty, advancing understanding of variability in eye-movement patterns across reading regimes and informing practical applications in education and assistive technologies.

Abstract

Readers can have different goals with respect to the text that they are reading. Can these goals be decoded from their eye movements over the text? In this work, we examine for the first time whether it is possible to distinguish between two types of common reading goals: information seeking and ordinary reading for comprehension. Using large-scale eye tracking data, we address this task with a wide range of models that cover different architectural and data representation strategies, and further introduce a new model ensemble. We find that transformer-based models with scanpath representations coupled with language modeling solve it most successfully, and that accurate predictions can be made in real time, long before the participant finished reading the text. We further introduce a new method for model performance analysis based on mixed effect modeling. Combining this method with rich textual annotations reveals key properties of textual items and participants that contribute to the difficulty of the task, and improves our understanding of the variability in eye movement patterns across the two reading regimes.

Paper Structure

This paper contains 25 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Proposed task: decoding whether a reader is seeking specific information or reading for general comprehension, given their eye movements over a single passage. In the eye movements image, circles represent fixations, and lines represent saccades. Bounding boxes mark word Interest Areas (fixations within the box are assigned to the respective word).
  • Figure 2: A schematic depiction of one of the 10 splits into train, validation, and the three test sets for one batch of 10 OneStopQA articles and 120 participants. Dashed lines denote information seeking trials. The full data split consists of the union of three such splits.
  • Figure 3: Coefficients from a mixed-effects model that predicts whether RoBERTa-Eye-F's prediction for a given trial is correct from properties of the trial. CS stands for the critical span, the portion of the paragraph that contains the information that is essential for answering the question correctly. Two models are fitted separately for ordinary reading and information seeking trials. Predictors are z-normalized. Depicted are the coefficients of the fitted models after a 10x Bonferroni correction, to mitigate the risk of false positives when testing multiple hypotheses simultaneously. '*' $p<0.05$, '**' $p<0.01$, '***' $p<0.001$.
  • Figure 4: An example of a scanpath as an image as used for the image classification models.
  • Figure 5: Visualization of the different model architectures (Part 1). $P$ represents the paragraph, $E^P_S$ the eye movements of participant $S$ on $P$. $LM$ stands for a language model, and $FC$ for fully connected layers. $FF_{f_i}$ stands for the fixation features and $w_{f_i}$ for the word corresponding to the $i$-th fixation respectively.
  • ...and 4 more figures