Learning to Judge: LLMs Designing and Applying Evaluation Rubrics
Clemencia Siro, Pourya Aliannejadi, Mohammad Aliannejadi
TL;DR
This work introduces GER-Eval, a two-stage framework that lets LLMs both generate and apply their own evaluation rubrics, then assesses how these self-defined rubrics align with human criteria and transfer across models. The approach formalizes rubric generation and scoring with notation such as $M = \{m_i\}$, $m_i = (n_i, d_i, s_i)$, and scores $s_{i,j}$, enabling controlled analyses across tasks, prompts, and model families. Across four NLP benchmarks, LLMs produce coherent, task-relevant criteria and apply them with high internal consistency, but cross-model generalization and human alignment are stronger in conversational or instruction-following domains than in factual or biomedical domains. Closed-source GPT-4o tends to outperform open-weight models in rubric quality and alignment, highlighting a reliability-validity trade-off and the need for collaborative human–LLM evaluation frameworks that couple linguistic calibration with domain grounding.
Abstract
Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them consistently within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.
