PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus

Deborah N. Jakobi; Thomas Kern; David R. Reich; Patrick Haller; Lena A. Jäger

PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus

Deborah N. Jakobi, Thomas Kern, David R. Reich, Patrick Haller, Lena A. Jäger

TL;DR

PoTeC addresses the lack of naturalistic German eye-tracking data with an expertise-aware within-subject design. It implements a $2\times 2\times 2$ fully crossed design across discipline, level of study, and text domain, using twelve physics- and biology-based textbook texts read by 75 participants. The corpus combines rich manual and model-based linguistic annotations, multiple word- and text-level features, and both raw and corrected fixation data, all published with open data and integrated tooling in the pymovements ecosystem. The dataset enables analyses of expert versus non-expert reading, supports NLP and cognitive modeling applications, and emphasizes transparency, reproducibility, and reusability of eye-tracking data and preprocessing pipelines.

Abstract

The Potsdam Textbook Corpus (PoTeC) is a naturalistic eye-tracking-while-reading corpus containing data from 75 participants reading 12 scientific texts. PoTeC is the first naturalistic eye-tracking-while-reading corpus that contains eye-movements from domain-experts as well as novices in a within-participant manipulation: It is based on a 2x2x2 fully-crossed factorial design which includes the participants' level of study and the participants' discipline of study as between-subject factors and the text domain as a within-subject factor. The participants' reading comprehension was assessed by a series of text comprehension questions and their domain knowledge was tested by text-independent background questions for each of the texts. The materials are annotated for a variety of linguistic features at different levels. We envision PoTeC to be used for a wide range of studies including but not limited to analyses of expert and non-expert reading strategies. The corpus and all the accompanying data at all stages of the preprocessing pipeline and all code used to preprocess the data are made available via GitHub: https://github.com/DiLi-Lab/PoTeC.

PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus

TL;DR

PoTeC addresses the lack of naturalistic German eye-tracking data with an expertise-aware within-subject design. It implements a

fully crossed design across discipline, level of study, and text domain, using twelve physics- and biology-based textbook texts read by 75 participants. The corpus combines rich manual and model-based linguistic annotations, multiple word- and text-level features, and both raw and corrected fixation data, all published with open data and integrated tooling in the pymovements ecosystem. The dataset enables analyses of expert versus non-expert reading, supports NLP and cognitive modeling applications, and emphasizes transparency, reproducibility, and reusability of eye-tracking data and preprocessing pipelines.

Abstract

Paper Structure (34 sections, 1 equation, 4 figures, 19 tables)

This paper contains 34 sections, 1 equation, 4 figures, 19 tables.

Introduction
New standard for data publication
Related Work
Naturalistic text passage corpora for German
Naturalistic text passage corpora for languages other than German
Single-sentence corpora with partially constructed stimuli for different languages
Naturalistic single-sentence corpora for different languages
Naturalistic self-paced reading corpora for different languages
Variations of naturalistic eye-tracking-while-reading corpora
Differences of PoTeC to existing corpora
Methods
Materials
Stimulus texts
Comprehension & background questions
Manual word level annotation
...and 19 more sections

Figures (4)

Figure 1: The 2$\times$2$\times$2 fully-crossed factorial study design of PoTeC. The red cubes denote expert reading, that is, participants having the level of studiesgraduate who are reading a text, whose text domain is equal to the reader's discipline of studies.
Figure 2: Domain-specific and text-specific summary of the word length in characters, the log-lexical lemma frequency, and surprisal (as estimated by GPT-2 large).
Figure 3: Posterior effect estimates for predictors expert reading, reader discipline, the interaction of expert reading with reader discipline, expert technical term as well as log-lexical frequency, surprisal and word length and the interactions of log-lexical frequency, surprisal, and word length with expert reading.
Figure 4: Domain-specific and text-specific summary of word length (both character and syllable counts), log-lexical frequency (lemma and type) and surprisal (GPT-2 large estimated with left-hand sentence context) across texts.

PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus

TL;DR

Abstract

PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus

Authors

TL;DR

Abstract

Table of Contents

Figures (4)