Rater Equivalence: Evaluating Classifiers in Human Judgment Settings
Paul Resnick, Yuqing Kong, Grant Schoenebeck, Tim Weninger
TL;DR
The paper addresses evaluating classifiers in settings where ground truth is unavailable and human labels drive evaluation. It introduces rater equivalence and power curves to quantify how many independent human raters are needed for a benchmark to match a classifier’s performance, under objective and subjective utility models. A calibrated Anonymous Bayesian Combiner (ABC) is proposed to optimally synthesize benchmark labels, minimizing rater equivalence under cross-entropy scoring. Through theory and case studies, the work shows when larger human panels help or hinder, and provides practical guidance for deploying AI in human-centered tasks. This framework reframes benchmarking in domains with nuanced human judgments, offering interpretable metrics and robust methods for panel-based evaluation.
Abstract
In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.
