Table of Contents
Fetching ...

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin, Gongjun Xu, Moulinath Banerjee, Yuekai Sun

TL;DR

Bridge provides a unified statistical framework that explicitly links human judgments and LLM judgments via a latent human score and covariate-driven LLM deviations. It offers a practical fitting procedure (the logit trick) and asymptotic theory to calibrate LLM judgments and test for systematic human–LLM gaps. The approach, validated on six LLM judges across BigGen Bench and Chatbot Arena, yields improved alignment and probabilistic calibration with limited human labels, while revealing actionable covariates driving divergences. These results support more reliable, transparent evaluation of AI systems at scale and guide targeted bias mitigation and data-collection strategies.

Abstract

Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

TL;DR

Bridge provides a unified statistical framework that explicitly links human judgments and LLM judgments via a latent human score and covariate-driven LLM deviations. It offers a practical fitting procedure (the logit trick) and asymptotic theory to calibrate LLM judgments and test for systematic human–LLM gaps. The approach, validated on six LLM judges across BigGen Bench and Chatbot Arena, yields improved alignment and probabilistic calibration with limited human labels, while revealing actionable covariates driving divergences. These results support more reliable, transparent evaluation of AI systems at scale and guide targeted bias mitigation and data-collection strategies.

Abstract

Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.

Paper Structure

This paper contains 42 sections, 3 theorems, 49 equations, 9 figures, 19 tables, 1 algorithm.

Key Result

Proposition 3.1

Under Conditions asmp:ident and asmp:hessian (stated in Appendix append:conds), the estimator $(\hat{\eta},\hat{Z}^l_{1:n}) \in\arg\min_{(\eta, z_{1:n})\in\Theta_\eta\times\mathcal{Z}} Q_{n,m_n}(\eta,z_{1:n})$ satisfies $\sqrt{m_n}[(\hat{\eta},\hat{Z}^l_{1:n})-(\eta^*, Z^{l,*}_{1:n})]$ converges in

Figures (9)

  • Figure 1: The logit trick for model fitting. This procedure allows us to fit the statistical model without observing human latent scores $Z^h$. First, the LLM judge rates a pair (prompt, response(s)). Second, we compute/estimate the probability of each score $k$. Third, we process the probabilities, obtaining the LLM scores $Z^l\in {\mathbb{R}}$. Finally, we fit an ordinal logistic regression model for human ratings $Y^h$ given $Z^l$ and covariates $X$ to explain the gap between human and LLM scores.
  • Figure 2: Performance comparison of our proposed methods, logistic-regression baseline, and raw LLM judgments across all datasets. Our methods consistently match or outperform the baselines, notably excelling on BigGen Bench, likely thanks to sensible inductive biases.
  • Figure 3: Covariates $X$ are important. Dots indicate how adding covariates alters the predicted human preference, with colors marking the most influential.
  • Figure 4: Reconstruction loss for BigGenBench
  • Figure 5: Reconstruction loss for Chatbot Arena
  • ...and 4 more figures

Theorems & Definitions (5)

  • Proposition 3.1: Consistency of CoT estimates $\hat{\eta}_k$ and $\hat{Z}_i^l$
  • Theorem 3.2: Asymptotic normality of $(\hat{\beta},\hat{\gamma})$
  • Theorem C.6: Asymptotic normality of $(\hat{\beta},\hat{\gamma})$
  • proof : Proof of Proposition \ref{['prop:stage1_clt']}
  • proof : Proof of Theorem \ref{['thm:beta_gamma_normality-ext']}