Table of Contents
Fetching ...

CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, Najoung Kim

TL;DR

This work tackles the reliability shortcomings of LLM-based text-evaluation by introducing CheckEval, a three-stage framework that decomposes evaluation into fine-grained, binary yes/no checklist questions. Seed questions are expanded via independent augmentation ( diversification and elaboration) and pruned through filtering to ensure alignment with benchmark objectives. Evaluations across SummEval and Topical-Chat with 12 evaluator models show CheckEval achieving higher correlation with human judgments and significantly better inter-evaluator agreement, while also offering increased interpretability through traceable checklist responses. The results indicate CheckEval as a scalable, reliable, and explainable alternative to Likert-scale LLM evaluators for NLG tasks, with future directions including automated checklist design and broader task applicability.

Abstract

Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.

CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

TL;DR

This work tackles the reliability shortcomings of LLM-based text-evaluation by introducing CheckEval, a three-stage framework that decomposes evaluation into fine-grained, binary yes/no checklist questions. Seed questions are expanded via independent augmentation ( diversification and elaboration) and pruned through filtering to ensure alignment with benchmark objectives. Evaluations across SummEval and Topical-Chat with 12 evaluator models show CheckEval achieving higher correlation with human judgments and significantly better inter-evaluator agreement, while also offering increased interpretability through traceable checklist responses. The results indicate CheckEval as a scalable, reliable, and explainable alternative to Likert-scale LLM evaluators for NLG tasks, with future directions including automated checklist design and broader task applicability.

Abstract

Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.
Paper Structure (51 sections, 9 figures, 27 tables)

This paper contains 51 sections, 9 figures, 27 tables.

Figures (9)

  • Figure 1: Overall process of CheckEval. CheckEval consists of three stages: (1) Defining Dimensions of Evaluation, where humans select specific dimensions and define sub-dimensions; (2) Checklist Generation, which incorporates two augmentation methods—question diversification (green) and elaboration (blue); and (3) Checklist-based Evaluation, where the model responds to the checklist with yes/no answers.
  • Figure 2: Human validation scores for the checklist generation process, averaged across all dimensions on both SummEval and Topical-Chat. 'Augmentation' refers to the percentage of augmented questions that fulfilled the specified quality criteria, and 'Filtering' refers to the percentage for filtered questions.
  • Figure 3: Kernel density estimation (KDE) of correlations with human judgments for G-Eval (purple), SEEval (blue) and CheckEval (pink) across different evaluator models on SummEval and Topical-Chat. Dashed lines indicate mean correlation values.
  • Figure 4: dimension-wise correlation analysis of G-Eval (purple) and CheckEval (pink), with samples divided based on human annotator ratings into High-Quality (human ratings $\geq$3) and Low-Quality (human ratings $<$3) groups. Each bar represents correlation with human judgments across different quality dimensions.
  • Figure 5: Evaluation Prompt - SummEval
  • ...and 4 more figures