Table of Contents
Fetching ...

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Jodi M. Casabianca, Maggie Beiting-Parrish

TL;DR

It is shown how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance, and how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior.

Abstract

Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

TL;DR

It is shown how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance, and how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior.

Abstract

Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.
Paper Structure (27 sections, 1 equation, 5 figures, 1 table)

This paper contains 27 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Within-rater score distributions by dimension. Bars show the percentage of ratings assigned to each score (1–7) for Factual accuracy, Coherence, Content coverage, and Overall quality, faceted by rater.
  • Figure 2: Array of scatterplots showing MFRM severity (logits) versus centrality (SD of thresholds), by rater. Each panel shows a single rater; each point represents a policy scored by that rater. Solid lines indicate linear trends with 95% confidence bands.
  • Figure 3: Average rater profiles across policies. Each point represents a rater’s mean MFRM severity (logits) and mean centrality (standard deviation of category thresholds) averaged over all policies scored. Point size reflects the number of policies rated; the solid line shows the linear association with 95% confidence band (Pearson r shown).
  • Figure 4: Policy-level summary quality based on raw scores and IRT model estimates. Average quality estimates are plotted by model size and color coded by training method. Panel (a) shows mean human ratings across four items on the 7-point scale, while panel (b) shows the PCM estimates, and (c) shows corresponding MFRM quality estimates ($\theta$, logits). Points represent model variants, grouped by model size. Dashed horizontal reference lines mark human-written summaries and the Lead-3 baseline. Note: While the value of the estimates are not directly comparable, the differences in rank ordering are meaningful.
  • Figure 5: Rater severity and centrality by policy. Each panel shows one policy. Points represent individual raters. The line shows the overall trend for each policy.