EMTeC: A Corpus of Eye Movements on Machine-Generated Texts
Lena Sophia Bolliger, Patrick Haller, Isabelle Caroline Rose Cretton, David Robert Reich, Tannon Kew, Lena Ann Jäger
TL;DR
EMTeC introduces a comprehensive eye-tracking corpus collected while readers processed machine-generated texts produced by three LLMs across six text types and five decoding strategies. The dataset uniquely provides raw gaze coordinates, fixations, reading measures, and LLM internals (transition scores, attention, hidden states), alongside rich text- and word-level linguistic annotations and comprehension questions, all released with full preprocessing code to maximize reproducibility. The work also documents a full preprocessing pipeline, including manual drift correction, and demonstrates the utility of EMTeC through descriptive and psycholinguistic analyses, showing robust reading-behavior patterns across models and decoding strategies. By making stimuli, annotations, and analysis tooling openly available, EMTeC offers a versatile resource for studying human–AI text interaction, improving pre-processing methods, evaluating decoding strategies, and informing cognitive interpretability and surprisal-based predictions in reading times. Overall, EMTeC bridges cognitive science and NLP by enabling rigorous investigation of reading behavior on AI-produced text and supporting methodological advances in both areas.
Abstract
The Eye Movements on Machine-Generated Texts Corpus (EMTeC) is a naturalistic eye-movements-while-reading corpus of 107 native English speakers reading machine-generated texts. The texts are generated by three large language models using five different decoding strategies, and they fall into six different text type categories. EMTeC entails the eye movement data at all stages of pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation sequences, and the reading measures. It further provides both the original and a corrected version of the fixation sequences, accounting for vertical calibration drift. Moreover, the corpus includes the language models' internals that underlie the generation of the stimulus texts: the transition scores, the attention scores, and the hidden states. The stimuli are annotated for a range of linguistic features both at text and at word level. We anticipate EMTeC to be utilized for a variety of use cases such as, but not restricted to, the investigation of reading behavior on machine-generated text and the impact of different decoding strategies; reading behavior on different text types; the development of new pre-processing, data filtering, and drift correction algorithms; the cognitive interpretability and enhancement of language models; and the assessment of the predictive power of surprisal and entropy for human reading times. The data at all stages of pre-processing, the model internals, and the code to reproduce the stimulus generation, data pre-processing and analyses can be accessed via https://github.com/DiLi-Lab/EMTeC/.
