LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Helia Hashemi; Jason Eisner; Corby Rosset; Benjamin Van Durme; Chris Kedzie

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, Chris Kedzie

TL;DR

The paper presents LLM-Rubric, a framework that uses a manually authored evaluation rubric to elicit multidimensional LLM responses per text, then calibrates these responses with a judge-specific network to predict individual human judgments, including overall satisfaction. By modeling full distributions across rubric questions and optimizing for likelihood, the method achieves calibrated, high-fidelity predictions that outperform uncalibrated LLM scores and several baselines in information-seeking dialogue evaluation. The approach demonstrates strong calibration, data-efficiency, and notable improvements over baselines on both synthetic and real dialogues, with insights into which rubric dimensions drive accuracy and how to extend the framework. The work also discusses practical applications, robustness, and ethical considerations for deploying automated rubric-based evaluation at scale.

Abstract

This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be $\textit{combined}$ to $\textit{predict}$ each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1--4, with RMS error $< 0.5$, a $2\times$ improvement over the uncalibrated baseline.

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

TL;DR

Abstract

each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1--4, with RMS error

, a

improvement over the uncalibrated baseline.

Paper Structure (66 sections, 4 equations, 5 figures, 6 tables)

This paper contains 66 sections, 4 equations, 5 figures, 6 tables.

Introduction
The LLM-Rubric Method
Evaluation Rubric Construction.
Multi-Dimensional Evaluation with LLMs.
Aggregated Evaluation with Personalized Calibration.
Decoding.
Calibration Network Architecture.
Multi-Task Learning.
Using the Predictions.
Future Extensions.
Data
Mining Topics for RAG
Synthetic Dialogue Generation
Real Dialogue Collection and Evaluation
Experiments
...and 51 more sections

Figures (5)

Figure 1: An overview of the LLM-Rubric framework. The LLM and its prompts are fixed across texts and judges, but the calibration network weights are trained to predict the responses of various human judges.
Figure 2: Our calibration network learns how different human judges use the response range 1--4. Each black curve shows a different judge's distribution of responses to the "overall satisfaction" question $Q_0$ on our synthetic conversation dataset. (We show the judges who evaluated $\geq 30$ conversations.) The corresponding gray curve shows the average distribution predicted for that judge on the same dialogues by LLM-Rubric (using cross-validation). The final curve in light gray shows the original uncalibrated distribution of responses to $Q_0$ by the LLM (gpt-3.5-turbo-16k).
Figure 3: User interface for real dialogue collection and evaluation.
Figure 4: Learning curve for training the personalized calibration network in LLM-Rubric on synthetic conversations and testing on the real conversation data. The model's performance becomes relatively stable after observing 80% of the training data. Note that the LLM itself is not fine-tuned to predict any judge's responses.
Figure 5: Calibration plots for $Q_0$ on held-out synthetic dialogues, as explained in \ref{['sec:analysis', 'app:calibration']}. These are plots for $y_0 \in \{1,2,3,4\}$ respectively. They show low calibration error.

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

TL;DR

Abstract

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Authors

TL;DR

Abstract

Table of Contents

Figures (5)