Quantitative LLM Judges
Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan Rossi, Uttaran Bhattacharya, Branislav Kveton
TL;DR
This paper tackles the miscalibration and instability of LLM-based evaluation by decoupling qualitative reasoning from quantitative scoring. It introduces quantitative LLM judges, a family of generalized linear models that post-hoc calibrate a frozen base judge's rationale and score to human judgments across absolute and relative tasks. Four variants—Least-Squares, Multinomial, Bradley-Terry-Luce, and Two-Headed BTL—map base-judge outputs to domain-specific human scores using lightweight predictors, improving predictive power with substantially less computation than fine-tuning. Experiments on four datasets with two base judges demonstrate consistent improvements in rating and preference predictions, highlighting the approach's data efficiency, scalability, and practical impact for domain-calibrated LLM evaluation.
Abstract
LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain using regression models. The models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in practice. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
