Table of Contents
Fetching ...

Skewed Score: A statistical framework to assess autograders

Magda Dubois, Harry Coppock, Mario Giulianelli, Timo Flesch, Lennart Luettgau, Cozmin Ududec

TL;DR

A statistical framework based on Bayesian generalised linear models that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation) is proposed, enabling both performance analysis and bias detection.

Abstract

The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying sources of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.

Skewed Score: A statistical framework to assess autograders

TL;DR

A statistical framework based on Bayesian generalised linear models that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation) is proposed, enabling both performance analysis and bias detection.

Abstract

The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying sources of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.

Paper Structure

This paper contains 3 sections, 8 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Illustration of how to use Bayesian GLMs to address \ref{['sec:q11']} (quantifying the mean difference between scores assigned by an autograder and a human grader) using simulated data. Left panel: Violin plot of simulated scores for LLM-generated answers graded by a human expert (Florence in the example) and an autograder. Right panel: Posterior distributions of estimated effects. The horizontal blue lines represent 95% credible intervals. The dashed red vertical line indicates a null effect ($\beta = 0$). The coefficient for autograder minus human is negative, and its credible interval does not include zero, indicating strong evidence that autograders assign lower scores than human graders.
  • Figure 2: Illustration of how to use Bayesian GLMs to address \ref{['sec:q12']} (quantifying the mean difference between scores assigned by an autograder and a human grader while evaluating LLMs) using simulated data. Left panel: Violin plot of simulated scores for LLM-generated answers graded by a human expert (Florence in the example) and an autograder. Right panel: Posterior distributions of estimated effects. The horizontal blue lines represent 95% credible intervals. The dashed red vertical line indicates a null effect ($\beta = 0$). The coefficient for autograder minus human is negative, with a credible interval that does not include zero, indicating that the autograder tends to assign lower scores. The coefficient for LLM A minus LLM B is positive, suggesting that LLM A receives higher scores than LLM B on average.
  • Figure 3: Illustration of how to use Bayesian GLMs to address \ref{['sec:q2']} (Do autograders favour their own generation?) using simulated data. Left panel: Violin plot of simulated scores for LLM-generated answers by two LLMs (LLM A and LLM B). The scores were given by a human expert (green) and two autograders (yellow and red). Right panel: Posterior distributions of estimated effects from the GLM. The horizontal blue lines represent 95% credible intervals, and the dashed red vertical line indicates a null effect ($\beta = 0$). The grader effect ($\beta_1$) shows how each grader deviates from the average score across all graders and LLMs. The LLM effect ($\beta_2$) is positive, indicating that LLM A generally receives higher scores than LLM B. The grader–LLM terms ($\beta_3$) represent a set of parameters (one for each grader–LLM combination) estimated using index-based coding. These parameters are not traditional interaction effects (e.g., a single $\beta_3 \cdot X_{\text{grader}} \cdot X_{\text{LLM}}$ term), but instead allow for direct comparison between specific combinations. The difference in estimated effects for Autograder A on LLM A versus Autograder B on LLM A reveals a non-overlapping contrast, consistent with grader-specific scoring preferences. Together, these results suggest that each autograder favours the LLM it was developed on, indicative of a systematic self-bias.
  • Figure 4: Illustration of how to use Bayesian GLMs to address \ref{['sec:q3']} (Do autograders differ systematically from human experts?) using simulated data. Left panel: Violin plot of simulated scores for LLM-generated answers from two models (LLM A and LLM B), as graded by multiple human experts (green) and autograders (yellow and red). Right panel: Posterior distributions of estimated effects from the hierarchical model. The horizontal blue lines represent 95% credible intervals, and the dashed red vertical line indicates a null effect ($\beta = 0$). Individual grader effects show how each grader deviates from their respective group-level average (human or autograder). The plotted effect for LLM A minus LLM B represents $2\beta_2$, capturing the full latent score difference between the two models under effect coding. Group-level means for human and autograder graders ($\mu_{\text{graderType}}$) indicate that, on average, human graders assign higher scores than autograders.
  • Figure 5: Illustration of how to use Bayesian GLMs to address \ref{['sec:q4']} (How do scores differ at an item level?) using simulated data. Left panel: Violin plot of simulated scores for each item (1–4), grouped by LLM and grader identity. Each cell shows the distribution of scores assigned by a given grader to responses from a particular model on a given item. Right panel: Posterior distributions of estimated effects from the item-level GLM (\ref{['eq:model_item_grader']}). The plot shows main effects for grader and LLM identity (top), item main effects (bottom), and grader–item interactions (middle). Horizontal blue lines represent 95% credible intervals, and the dashed red vertical line indicates a null effect ($\beta = 0$). Item 1 has a strong positive effect, suggesting it consistently receives higher scores. In contrast, Item 4 has a negative effect, indicating that it receives lower scores. Grader–item interaction terms are small and uncertain, indicating no evidence of systematic grader disagreement on specific items.
  • ...and 6 more figures