LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts
Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, Chris Kedzie
TL;DR
The paper presents LLM-Rubric, a framework that uses a manually authored evaluation rubric to elicit multidimensional LLM responses per text, then calibrates these responses with a judge-specific network to predict individual human judgments, including overall satisfaction. By modeling full distributions across rubric questions and optimizing for likelihood, the method achieves calibrated, high-fidelity predictions that outperform uncalibrated LLM scores and several baselines in information-seeking dialogue evaluation. The approach demonstrates strong calibration, data-efficiency, and notable improvements over baselines on both synthetic and real dialogues, with insights into which rubric dimensions drive accuracy and how to extend the framework. The work also discusses practical applications, robustness, and ethical considerations for deploying automated rubric-based evaluation at scale.
Abstract
This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be $\textit{combined}$ to $\textit{predict}$ each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1--4, with RMS error $< 0.5$, a $2\times$ improvement over the uncalibrated baseline.
