Tracing How Annotators Think: Augmenting Preference Judgments with Reading Processes
Karin de Langis, William Walker, Khanh Chi Le, Dongyeop Kang
TL;DR
This work introduces PreferRead, a dataset that augments traditional preference annotations with fine-grained reading-process signals captured via mouse tracking. By collecting 1,000 preference-item items annotated by three crowdworkers each, the authors derive metrics such as rereading frequency, reading-path length, and word-level durations to study how reading behavior relates to annotation outcomes and agreement. Key findings show that re-reading correlates with higher inter-annotator agreement while longer reading paths and loops associate with disagreement, suggesting reading-process data reveal cognitive dynamics behind subjective judgments. The study demonstrates the feasibility of scalable reading-augmented annotations and highlights their potential to improve reliability modeling and annotation design, with released code and data for reuse.
Abstract
We propose an annotation approach that captures not only labels but also the reading process underlying annotators' decisions, e.g., what parts of the text they focus on, re-read or skim. Using this framework, we conduct a case study on the preference annotation task, creating a dataset PreferRead that contains fine-grained annotator reading behaviors obtained from mouse tracking. PreferRead enables detailed analysis of how annotators navigate between a prompt and two candidate responses before selecting their preference. We find that annotators re-read a response in roughly half of all trials, most often revisiting the option they ultimately choose, and rarely revisit the prompt. Reading behaviors are also significantly related to annotation outcomes: re-reading is associated with higher inter-annotator agreement, whereas long reading paths and times are associated with lower agreement. These results demonstrate that reading processes provide a complementary cognitive dimension for understanding annotator reliability, decision-making and disagreement in complex, subjective NLP tasks. Our code and data are publicly available.
