Table of Contents
Fetching ...

Tracing How Annotators Think: Augmenting Preference Judgments with Reading Processes

Karin de Langis, William Walker, Khanh Chi Le, Dongyeop Kang

TL;DR

This work introduces PreferRead, a dataset that augments traditional preference annotations with fine-grained reading-process signals captured via mouse tracking. By collecting 1,000 preference-item items annotated by three crowdworkers each, the authors derive metrics such as rereading frequency, reading-path length, and word-level durations to study how reading behavior relates to annotation outcomes and agreement. Key findings show that re-reading correlates with higher inter-annotator agreement while longer reading paths and loops associate with disagreement, suggesting reading-process data reveal cognitive dynamics behind subjective judgments. The study demonstrates the feasibility of scalable reading-augmented annotations and highlights their potential to improve reliability modeling and annotation design, with released code and data for reuse.

Abstract

We propose an annotation approach that captures not only labels but also the reading process underlying annotators' decisions, e.g., what parts of the text they focus on, re-read or skim. Using this framework, we conduct a case study on the preference annotation task, creating a dataset PreferRead that contains fine-grained annotator reading behaviors obtained from mouse tracking. PreferRead enables detailed analysis of how annotators navigate between a prompt and two candidate responses before selecting their preference. We find that annotators re-read a response in roughly half of all trials, most often revisiting the option they ultimately choose, and rarely revisit the prompt. Reading behaviors are also significantly related to annotation outcomes: re-reading is associated with higher inter-annotator agreement, whereas long reading paths and times are associated with lower agreement. These results demonstrate that reading processes provide a complementary cognitive dimension for understanding annotator reliability, decision-making and disagreement in complex, subjective NLP tasks. Our code and data are publicly available.

Tracing How Annotators Think: Augmenting Preference Judgments with Reading Processes

TL;DR

This work introduces PreferRead, a dataset that augments traditional preference annotations with fine-grained reading-process signals captured via mouse tracking. By collecting 1,000 preference-item items annotated by three crowdworkers each, the authors derive metrics such as rereading frequency, reading-path length, and word-level durations to study how reading behavior relates to annotation outcomes and agreement. Key findings show that re-reading correlates with higher inter-annotator agreement while longer reading paths and loops associate with disagreement, suggesting reading-process data reveal cognitive dynamics behind subjective judgments. The study demonstrates the feasibility of scalable reading-augmented annotations and highlights their potential to improve reliability modeling and annotation design, with released code and data for reuse.

Abstract

We propose an annotation approach that captures not only labels but also the reading process underlying annotators' decisions, e.g., what parts of the text they focus on, re-read or skim. Using this framework, we conduct a case study on the preference annotation task, creating a dataset PreferRead that contains fine-grained annotator reading behaviors obtained from mouse tracking. PreferRead enables detailed analysis of how annotators navigate between a prompt and two candidate responses before selecting their preference. We find that annotators re-read a response in roughly half of all trials, most often revisiting the option they ultimately choose, and rarely revisit the prompt. Reading behaviors are also significantly related to annotation outcomes: re-reading is associated with higher inter-annotator agreement, whereas long reading paths and times are associated with lower agreement. These results demonstrate that reading processes provide a complementary cognitive dimension for understanding annotator reliability, decision-making and disagreement in complex, subjective NLP tasks. Our code and data are publicly available.

Paper Structure

This paper contains 24 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our annotation interface records reading behavior of annotators as they evaluate and select preferred texts, providing fine-grained insights into the decision-making process during annotation.
  • Figure 2: An example item from PreferRead with the collected label and reading behaviors from one annotator. Darker highlights indicate relatively long gaze durations, and lighter highlights indicate relatively short gaze durations. Word-level gaze estimates can hint at asepcts of human decision-making; for instance, attention is concentrated on the odd turn of phrase "the most interesting thing about mayonnaise is it's used as a weapon" in the chosen response.
  • Figure 3: Our annotation interface shows a prompt followed by two possible responses. Participants mouse over blurred text to reveal it and read. All mouse movements are recorded.
  • Figure 4: Annotator word duration estimates from an item in PreferRead. Darker highlights indicate higher attention, e.g., longer gaze durations, while lighter highlights correspond to little or below-average duration. Notice that in the prompt (top box), the annotator spends more time on the human's questions than on the assistant's verbose reply.
  • Figure 5: A timeline of the mouse location of three annotators reading one item in our dataset.
  • ...and 3 more figures