Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

Felipe Maia Polo; Xinhe Wang; Mikhail Yurochkin; Gongjun Xu; Moulinath Banerjee; Yuekai Sun

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin, Gongjun Xu, Moulinath Banerjee, Yuekai Sun

TL;DR

Bridge provides a unified statistical framework that explicitly links human judgments and LLM judgments via a latent human score and covariate-driven LLM deviations. It offers a practical fitting procedure (the logit trick) and asymptotic theory to calibrate LLM judgments and test for systematic human–LLM gaps. The approach, validated on six LLM judges across BigGen Bench and Chatbot Arena, yields improved alignment and probabilistic calibration with limited human labels, while revealing actionable covariates driving divergences. These results support more reliable, transparent evaluation of AI systems at scale and guide targeted bias mitigation and data-collection strategies.

Abstract

Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

TL;DR

Abstract

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (5)