Table of Contents
Fetching ...

"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations

Michael Hardy

TL;DR

Methods for answering questions of accuracy, bias, fairness, and usefulness in the context of very low reliabilities from expert humans are demonstrated, and how model use would change to human label quality if models were used in a human-in-the-loop context is estimated.

Abstract

"Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even "super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.

"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations

TL;DR

Methods for answering questions of accuracy, bias, fairness, and usefulness in the context of very low reliabilities from expert humans are demonstrated, and how model use would change to human label quality if models were used in a human-in-the-loop context is estimated.

Abstract

"Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two Large Language Model (LLM) architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even "super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.

Paper Structure

This paper contains 27 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Data Processes and Sources for Studying Teaching and Annotation Quality
  • Figure 2: Spearman correlation coefficients and confidence intervals by MQI Item for all rater families and studies. Humankane_national_2015, Encoder (current study, Section \ref{['sec:modelfams']}), and GPTwang_is_2023 family correlations are between each rater and one randomly sampled human rater for each observation, following the processes used in the original human study, repeated 1,000 times for bootstrapped confidence intervals. xu_promises_2024 coefficients are reported from Tables 5 and 9 of that paper, where each number represents the best of several ensemble models fit for each individual item. Bold in the table indicates highest performing label family.
  • Figure 3: Correlations (fainter color hues, numerator of Eq. \ref{['eq:disatten']}), disattenuated correlations (darker color hues, Eq. \ref{['eq:disatten']}), and their respective 95% confidence intervals between human raters and model raters by MQI item. Item-level rater-label generalizability for both human and model raters, $\mathbf{E}\rho^2$. The attenuated and disattenuated correlations between humans and models $\varrho_{hm}$ are shown. The attenuated correlation confidence intervals were calculated with the standard Fisher Transformation and $\alpha = 0.05$. Disattenuated correlation confidence intervals used the empirical method recommended in charles_correction_2005.
  • Figure 4: Section \ref{['sec:evalmethods']} Study Method Results for four focus MQI Items across Human kane_national_2015, Encoder (this study), and GPT wang_is_2023 rater families. (a)Distributions. Score distributions by rater type. (b)Reliabilities. Inter-rater reliability metrics introduced in Section \ref{['sec:reliabilities']}. C's $\kappa$: Cohen's $\kappa$; QWK: Quadratic Weighted Kappa; %Agr: percent exact agreement; %Agr±1: percent agreement within 1 category; ICC: intraclass correlation; AICC: adjusted intraclass correlation; ${r}$: Pearson's correlation; $\mathbf{\rho}$: Spearman's rank correlation; Bold format is highest value for a given metric. (c)Generalizability Measures and Spurious Correlation Detection. Section \ref{['sec:gtheorystudy']}: generalizability coefficient $\mathbf{E}\rho^2$ and dependability measure $\mathit{\Phi}$. Section \ref{['sec:spurious']}$: \varrho_{\mathbb{hm}}$ is the disattenuated correlation. Red font indicates correlation was spurious or incalculable due to low reliabilities. (d)Disentangled Rater Bias. Section \ref{['sec:biases']}: standardized rater bias $\phi_{jr}$ (x axis) and rater variability/consistency, $\psi_{jr}$ (y axis) from Equation \ref{['eq:MHRM_SDM']}, $\eta_j$-centered. Each point represents an individual human or model rater. More severe raters are left, more lenient right. (e)Fairness across Racial Lines. Section \ref{['sec:fairness']}: Standardized difference in rater bias $\phi_r$ (x axis) and rater combined variability/consistency, $\psi_r$, (y axis) across Black teachers and White teachers. Leftward values are more severe towards Black teachers, rightward are more lenient. Any horizontal bar present with a marker represents 95% CI for bias. (f)Estimated Improvements to Reliability. Section \ref{['sec:dstudy']}: Expected changes to rating reliability are estimated improvements to quality (via reliability) of classroom ratings for various contexts. The single individual human baseline (black) estimates reliability improvements by visiting the same class the x axis represents the number of different 15 min. classroom observations of the same teacher. The red line is estimate of having a different human observer conduct observations as described. By contrast, for the model raters--single Encoder (green), Encoder ensemble (average of 3 encoders) (Red), and GPT ensemble (average of 3 GPT prompt engineered models)--the x-axis for models is the number of full classroom observations conducted where the human (black) observes at least 15 minutes (in-the-loop) of the same classroom (models observe the entire class period). A summary of these results can be found in Table \ref{['tab:focussummary']}.