Table of Contents
Fetching ...

Rater Equivalence: Evaluating Classifiers in Human Judgment Settings

Paul Resnick, Yuqing Kong, Grant Schoenebeck, Tim Weninger

TL;DR

The paper addresses evaluating classifiers in settings where ground truth is unavailable and human labels drive evaluation. It introduces rater equivalence and power curves to quantify how many independent human raters are needed for a benchmark to match a classifier’s performance, under objective and subjective utility models. A calibrated Anonymous Bayesian Combiner (ABC) is proposed to optimally synthesize benchmark labels, minimizing rater equivalence under cross-entropy scoring. Through theory and case studies, the work shows when larger human panels help or hinder, and provides practical guidance for deploying AI in human-centered tasks. This framework reframes benchmarking in domains with nuanced human judgments, offering interpretable metrics and robust methods for panel-based evaluation.

Abstract

In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.

Rater Equivalence: Evaluating Classifiers in Human Judgment Settings

TL;DR

The paper addresses evaluating classifiers in settings where ground truth is unavailable and human labels drive evaluation. It introduces rater equivalence and power curves to quantify how many independent human raters are needed for a benchmark to match a classifier’s performance, under objective and subjective utility models. A calibrated Anonymous Bayesian Combiner (ABC) is proposed to optimally synthesize benchmark labels, minimizing rater equivalence under cross-entropy scoring. Through theory and case studies, the work shows when larger human panels help or hinder, and provides practical guidance for deploying AI in human-centered tasks. This framework reframes benchmarking in domains with nuanced human judgments, offering interpretable metrics and robust methods for panel-based evaluation.

Abstract

In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.

Paper Structure

This paper contains 60 sections, 1 theorem, 60 equations, 8 figures, 10 tables, 2 algorithms.

Key Result

Lemma 15

Let $\{f_n\}$ be a sequence of random continuous bijections on $[0,K]$ with continuous inverses $f_n^{-1}$. Assume: Then:

Figures (8)

  • Figure 1: A framework for classifier evaluation in human judgment settings
  • Figure 2: An example power curve depicting a classifier's score of 0.13. The classifier's rater equivalence is 1.96. A single benchmark rater yields a lower expected score, whereas a benchmark panel comprising two raters shows a slightly higher expected score.
  • Figure 3: Top: objective utility model. Bottom: subjective utility model. Matching the ground truth is important on the top; matching the distribution of rater labels is important on the bottom.
  • Figure 4: Survey Equivalence between human labels and Jigsaw's Wikipedia comment personal attack classifier under different combiner and scoring function pairings. Survey equivalence score is indicated on the x axis. Error bars cover 95% of 500 bootstrap item samples.
  • Figure 5: Survey Equivalence between human labels of news credibility and CredBank's heuristic classifier for ABC and Cross-entropy scorer. Survey equivalence score is indicated on the x axis. Error bars cover 95% of 500 bootstrap samples.
  • ...and 3 more figures

Theorems & Definitions (41)

  • Claim 0: Objective Utility Model: Individual & Panel Labels Fail
  • Claim 0: Subjective Utility Model: Individual Label Works
  • Claim 0: Subjective Utility Model: Panel Labels Fail
  • Definition 1: Reliable Ordering
  • Claim 1: Bigger is not Always Better
  • Claim 1: More Agreement is Not Always Better
  • Claim 1: Subjective Utility Model: Individual Labels Work for Ordering Classifiers
  • Claim 1: Subjective Utility Model: Larger Panels Fail for Ordering Classifiers
  • Definition 2: Power Score
  • Definition 3: Power Curve
  • ...and 31 more