Table of Contents
Fetching ...

Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

Clemencia Siro, Pourya Aliannejadi, Mohammad Aliannejadi

TL;DR

This work introduces GER-Eval, a two-stage framework that lets LLMs both generate and apply their own evaluation rubrics, then assesses how these self-defined rubrics align with human criteria and transfer across models. The approach formalizes rubric generation and scoring with notation such as $M = \{m_i\}$, $m_i = (n_i, d_i, s_i)$, and scores $s_{i,j}$, enabling controlled analyses across tasks, prompts, and model families. Across four NLP benchmarks, LLMs produce coherent, task-relevant criteria and apply them with high internal consistency, but cross-model generalization and human alignment are stronger in conversational or instruction-following domains than in factual or biomedical domains. Closed-source GPT-4o tends to outperform open-weight models in rubric quality and alignment, highlighting a reliability-validity trade-off and the need for collaborative human–LLM evaluation frameworks that couple linguistic calibration with domain grounding.

Abstract

Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them consistently within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.

Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

TL;DR

This work introduces GER-Eval, a two-stage framework that lets LLMs both generate and apply their own evaluation rubrics, then assesses how these self-defined rubrics align with human criteria and transfer across models. The approach formalizes rubric generation and scoring with notation such as , , and scores , enabling controlled analyses across tasks, prompts, and model families. Across four NLP benchmarks, LLMs produce coherent, task-relevant criteria and apply them with high internal consistency, but cross-model generalization and human alignment are stronger in conversational or instruction-following domains than in factual or biomedical domains. Closed-source GPT-4o tends to outperform open-weight models in rubric quality and alignment, highlighting a reliability-validity trade-off and the need for collaborative human–LLM evaluation frameworks that couple linguistic calibration with domain grounding.

Abstract

Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them consistently within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.
Paper Structure (31 sections, 5 equations, 9 figures, 13 tables, 1 algorithm)

This paper contains 31 sections, 5 equations, 9 figures, 13 tables, 1 algorithm.

Figures (9)

  • Figure 1: Task-specific counts by model across datasets. Each color represents a model; lighter bars denote Total counts, and darker hatched bars denote Task-specific rubrics.
  • Figure 2: Percentage of task-specific rubrics identified by each model under three prompting conditions.
  • Figure 3: Agreement with 95% confidence intervals across datasets: (a) USR, (b) HelpSteer2, (c) SummEval, and (d) SumPubMed, between human and LLM scores when using human-defined rubrics.
  • Figure 4: Spearman correlations for two datasets (USR and SumPubMed) and two rubric sources (GPT-4o and Llama).
  • Figure 5: Rubric Frequencies from all datasets and conditions and all models, also overlapping with human rubrics.
  • ...and 4 more figures