Generalizable Error Modeling for Human Data Annotation: Evidence From an Industry-Scale Search Data Annotation Program

Heinrich Peters; Alireza Hashemi; James Rae

Generalizable Error Modeling for Human Data Annotation: Evidence From an Industry-Scale Search Data Annotation Program

Heinrich Peters, Alireza Hashemi, James Rae

TL;DR

Annotation errors in human-labeled data can degrade ML performance and auditing efficiency is a bottleneck in industry-scale programs. This work introduces a generalizable error-detection approach that uses both task features and behavioral features from the annotation process, achieving moderate predictive performance and cross-application generalization. It demonstrates SHAP-based insights into feature contributions and shows substantial efficiency gains for auditing and relabeling when prioritizing high-predicted-error tasks. The findings support a task-agnostic, behaviorally informed approach to data quality, enabling scalable improvements in industry-scale annotation pipelines.

Abstract

Machine learning (ML) and artificial intelligence (AI) systems rely heavily on human-annotated data for training and evaluation. A major challenge in this context is the occurrence of annotation errors, as their effects can degrade model performance. This paper presents a predictive error model trained to detect potential errors in search relevance annotation tasks for three industry-scale ML applications (music streaming, video streaming, and mobile apps). Drawing on real-world data from an extensive search relevance annotation program, we demonstrate that errors can be predicted with moderate model performance (AUC=0.65-0.75) and that model performance generalizes well across applications (i.e., a global, task-agnostic model performs on par with task-specific models). In contrast to past research, which has often focused on predicting annotation labels from task-specific features, our model is trained to predict errors directly from a combination of task features and behavioral features derived from the annotation process, in order to achieve a high degree of generalizability. We demonstrate the usefulness of the model in the context of auditing, where prioritizing tasks with high predicted error probabilities considerably increases the amount of corrected annotation errors (e.g., 40% efficiency gains for the music streaming application). These results highlight that behavioral error detection models can yield considerable improvements in the efficiency and quality of data annotation processes. Our findings reveal critical insights into effective error management in the data annotation process, thereby contributing to the broader field of human-in-the-loop ML.

Generalizable Error Modeling for Human Data Annotation: Evidence From an Industry-Scale Search Data Annotation Program

TL;DR

Abstract

Paper Structure (22 sections, 5 figures, 1 table)

This paper contains 22 sections, 5 figures, 1 table.

Introduction
Background and Related Work
Current Research
Method
Data Collection and Sampling
Search Relevance Annotation Tasks
Operationalizations and Data Preprocessing
Modeling
Results
Model Performance
Model Explainability
Generalizability
Model Application: Error Detection for Efficient Relabeling
Discussion
Interpretation of Results
...and 7 more sections

Figures (5)

Figure 1: Area under the receiver operating characteristics curve (AUC), accuracy, precision, and recall of error models trained for different ML applications and a task-agnostic model trained for all ML applications simultaneously. Precision and recall were computed as macro average across both classes. For a table showing exact numbers, please refer to SI B.
Figure 2: Cumulative mean SHAP values, ranked by magnitude, for error models trained for different ML applications and a task-agnostic model trained for all ML applications simultaneously. For a table showing exact numbers, please refer to SI E.
Figure 3: AUC matrix of generalization performance across search modalities and product categories (left; the training set is denoted on the y-axis; test set is denoted on the x-axis). Correlations of SHAP feature importances across search modalities and product categories (right). Exact values can be found in SI C and D.
Figure 4: Proportion of labels changed relative to audit volume as a function of the number of audited tasks for music streaming, mobile applications, video streaming, and all three ML applications combined.
Figure 5: Proportion of errors caught relative to overall number of errors as a function of the number of audited tasks for music streaming, mobile applications, video streaming, and all three ML applications combined.

Generalizable Error Modeling for Human Data Annotation: Evidence From an Industry-Scale Search Data Annotation Program

TL;DR

Abstract

Generalizable Error Modeling for Human Data Annotation: Evidence From an Industry-Scale Search Data Annotation Program

Authors

TL;DR

Abstract

Table of Contents

Figures (5)