Table of Contents
Fetching ...

Eye-Tracking-while-Reading: A Living Survey of Datasets with Open Library Support

Deborah N. Jakobi, David R. Reich, Paul Prasse, Jana M. Hofmann, Lena S. Bolliger, Lena A. Jäger

TL;DR

This paper addresses the lack of interoperability among eye-tracking-while-reading datasets by presenting a living, filterable online overview and by integrating publicly available datasets into the pymovements library, thereby promoting FAIR data principles. It combines a comprehensive corpus review with a practical data-sharing infrastructure: an online table detailing 45+ dataset features and a standardized preprocessing and access pipeline via pymovements. The key contributions are (i) a thorough, language-spanning survey of semi-naturalistic reading datasets, (ii) the online, editable dataset catalog, and (iii) the pymovements integration that enables consistent preprocessing and easy reuse. The work enables reproducible, cross-dataset analyses and lowers barriers to data reuse, fostering broader collaboration and methodological standardization in eye-tracking-while-reading research.

Abstract

Eye-tracking-while-reading corpora are a valuable resource for many different disciplines and use cases. Use cases range from studying the cognitive processes underlying reading to machine-learning-based applications, such as gaze-based assessments of reading comprehension. The past decades have seen an increase in the number and size of eye-tracking-while-reading datasets as well as increasing diversity with regard to the stimulus languages covered, the linguistic background of the participants, or accompanying psychometric or demographic data. The spread of data across different disciplines and the lack of data sharing standards across the communities lead to many existing datasets that cannot be easily reused due to a lack of interoperability. In this work, we aim at creating more transparency and clarity with regards to existing datasets and their features across different disciplines by i) presenting an extensive overview of existing datasets, ii) simplifying the sharing of newly created datasets by publishing a living overview online, https://dili-lab.github.io/datasets.html, presenting over 45 features for each dataset, and iii) integrating all publicly available datasets into the Python package pymovements which offers an eye-tracking datasets library. By doing so, we aim to strengthen the FAIR principles in eye-tracking-while-reading research and promote good scientific practices, such as reproducing and replicating studies.

Eye-Tracking-while-Reading: A Living Survey of Datasets with Open Library Support

TL;DR

This paper addresses the lack of interoperability among eye-tracking-while-reading datasets by presenting a living, filterable online overview and by integrating publicly available datasets into the pymovements library, thereby promoting FAIR data principles. It combines a comprehensive corpus review with a practical data-sharing infrastructure: an online table detailing 45+ dataset features and a standardized preprocessing and access pipeline via pymovements. The key contributions are (i) a thorough, language-spanning survey of semi-naturalistic reading datasets, (ii) the online, editable dataset catalog, and (iii) the pymovements integration that enables consistent preprocessing and easy reuse. The work enables reproducible, cross-dataset analyses and lowers barriers to data reuse, fostering broader collaboration and methodological standardization in eye-tracking-while-reading research.

Abstract

Eye-tracking-while-reading corpora are a valuable resource for many different disciplines and use cases. Use cases range from studying the cognitive processes underlying reading to machine-learning-based applications, such as gaze-based assessments of reading comprehension. The past decades have seen an increase in the number and size of eye-tracking-while-reading datasets as well as increasing diversity with regard to the stimulus languages covered, the linguistic background of the participants, or accompanying psychometric or demographic data. The spread of data across different disciplines and the lack of data sharing standards across the communities lead to many existing datasets that cannot be easily reused due to a lack of interoperability. In this work, we aim at creating more transparency and clarity with regards to existing datasets and their features across different disciplines by i) presenting an extensive overview of existing datasets, ii) simplifying the sharing of newly created datasets by publishing a living overview online, https://dili-lab.github.io/datasets.html, presenting over 45 features for each dataset, and iii) integrating all publicly available datasets into the Python package pymovements which offers an eye-tracking datasets library. By doing so, we aim to strengthen the FAIR principles in eye-tracking-while-reading research and promote good scientific practices, such as reproducing and replicating studies.
Paper Structure (37 sections, 2 figures)

This paper contains 37 sections, 2 figures.

Figures (2)

  • Figure 1: The eye-tracking-while-reading data preprocessing stages. Stage 1: Raw data consisting of x- and y-screen coordinates or yaw and pitch degrees of visual angle, and a timestamp. Stage 2: Gaze event data (fixations and saccades). Stage 3: AoI-based measures (area-of-interest). Each stage requires certain metadata and produces new metadata. The data from one stage is converted to the next stage by parametrized algorithms. If the algorithms and their parameters are unknown, the pipeline cannot be fully reproduced.
  • Figure 2: Cumulative overview of eye-tracking-while-reading datasets over time. (a) Cumulative number of published datasets by data modality, indicating which datasets are supported by the open-source library pymovements Krakowczyk_pymovementsETRA2023. (b) Cumulative number of stimulus languages and language familiesrepresented in published datasets. In both sub-figures, the x-axis denotes the year of publication.