Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
Felipe Maia Polo, Xinhe Wang, Mikhail Yurochkin, Gongjun Xu, Moulinath Banerjee, Yuekai Sun
TL;DR
Bridge provides a unified statistical framework that explicitly links human judgments and LLM judgments via a latent human score and covariate-driven LLM deviations. It offers a practical fitting procedure (the logit trick) and asymptotic theory to calibrate LLM judgments and test for systematic human–LLM gaps. The approach, validated on six LLM judges across BigGen Bench and Chatbot Arena, yields improved alignment and probabilistic calibration with limited human labels, while revealing actionable covariates driving divergences. These results support more reliable, transparent evaluation of AI systems at scale and guide targeted bias mitigation and data-collection strategies.
Abstract
Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.
