Table of Contents
Fetching ...

On Efficient and Statistical Quality Estimation for Data Annotation

Jan-Christoph Klie, Juan Haladjian, Marc Kirchner, Rahul Nair

TL;DR

The paper addresses the challenge of efficiently estimating annotation quality without inspecting the entire dataset. It develops two statistically grounded approaches: exact confidence-interval-based sample-size calculations for error-rate estimation and acceptance sampling to decide batch quality, potentially reducing observations by up to 50% while preserving guarantees. The authors provide theoretical analyses, practical plans (single, double, and sequential sampling), and a Python package, and they validate the methods on real NLP datasets. The findings indicate that acceptance sampling, particularly sequential sampling with curtailment, can offer substantial inspection savings while maintaining statistical rigor, making it a viable alternative to conventional error-rate estimation in data annotation workflows.

Abstract

Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs money that could be better spent, for instance on more annotations. Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.

On Efficient and Statistical Quality Estimation for Data Annotation

TL;DR

The paper addresses the challenge of efficiently estimating annotation quality without inspecting the entire dataset. It develops two statistically grounded approaches: exact confidence-interval-based sample-size calculations for error-rate estimation and acceptance sampling to decide batch quality, potentially reducing observations by up to 50% while preserving guarantees. The authors provide theoretical analyses, practical plans (single, double, and sequential sampling), and a Python package, and they validate the methods on real NLP datasets. The findings indicate that acceptance sampling, particularly sequential sampling with curtailment, can offer substantial inspection savings while maintaining statistical rigor, making it a viable alternative to conventional error-rate estimation in data annotation workflows.

Abstract

Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs money that could be better spent, for instance on more annotations. Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.
Paper Structure (24 sections, 8 equations, 10 figures, 3 tables)

This paper contains 24 sections, 8 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview of agile data corpus creation, the recommended workflow to annotate high-quality datasets. This work explores how to efficiently estimate annotation quality using statistics.
  • Figure 2: Flowcharts for the three different acceptance sampling methods discussed in this work.
  • Figure 3: Sampling error vs. margin of error when sampling without replacement for manually inspecting a dataset to estimate the error rate. We compute a hypergeometric confidence interval for different confidence levels $\alpha$ and two underlying, true error rates $p_e$ and $N=1000$. The closer the true error rate (and thereby hopefully the assumed error rate to compute the sample size) is to $0.5$, the larger the required sample size is. The jaggedness is caused by the distribution's discreteness.
  • Figure 4: Average sample numbers (ASN) required for a strict and relaxed configuration for Confidence Intervals (CI), Single Sampling Plans (SSP), Double Sampling Plans (DSP), and Sequential Sampling Plans based on the Sequential Probability Ratio Test (SPRT). Dotted lines are plans with curtailment. The confidence interval requiring the smaller sample size is the one assuming $p_a$.
  • Figure 5: Simulating using acceptance sampling on existing NLP datasets. We run 1000 simulations with different seeds and count how often a sample was accepted or rejected .
  • ...and 5 more figures