Table of Contents
Fetching ...

Check-Eval: A Checklist-based Approach for Evaluating Text Quality

Jayr Pereira, Andre Assumpcao, Roberto Lotufo

TL;DR

This work introduces Check-Eval, a checklist-based evaluation framework that leverages large language models to assess text quality in a structured, interpretable way. It supports reference-guided, candidate-guided, and criterion-guided modes, generating checklists from sources or explicit criteria and scoring candidates by key-point coverage. Across Portuguese Legal Semantic Textual Similarity and SummEval, Check-Eval shows higher alignment with human judgments than strong baselines like GPTScore and G-Eval, while also providing actionable feedback on specific improvements. The approach offers a scalable, interpretable alternative for NLG evaluation with potential for broader task applicability.

Abstract

Evaluating the quality of text generated by large language models (LLMs) remains a significant challenge. Traditional metrics often fail to align well with human judgments, particularly in tasks requiring creativity and nuance. In this paper, we propose \textsc{Check-Eval}, a novel evaluation framework leveraging LLMs to assess the quality of generated text through a checklist-based approach. \textsc{Check-Eval} can be employed as both a reference-free and reference-dependent evaluation method, providing a structured and interpretable assessment of text quality. The framework consists of two main stages: checklist generation and checklist evaluation. We validate \textsc{Check-Eval} on two benchmark datasets: Portuguese Legal Semantic Textual Similarity and \textsc{SummEval}. Our results demonstrate that \textsc{Check-Eval} achieves higher correlations with human judgments compared to existing metrics, such as \textsc{G-Eval} and \textsc{GPTScore}, underscoring its potential as a more reliable and effective evaluation framework for natural language generation tasks. The code for our experiments is available at \url{https://anonymous.4open.science/r/check-eval-0DB4}

Check-Eval: A Checklist-based Approach for Evaluating Text Quality

TL;DR

This work introduces Check-Eval, a checklist-based evaluation framework that leverages large language models to assess text quality in a structured, interpretable way. It supports reference-guided, candidate-guided, and criterion-guided modes, generating checklists from sources or explicit criteria and scoring candidates by key-point coverage. Across Portuguese Legal Semantic Textual Similarity and SummEval, Check-Eval shows higher alignment with human judgments than strong baselines like GPTScore and G-Eval, while also providing actionable feedback on specific improvements. The approach offers a scalable, interpretable alternative for NLG evaluation with potential for broader task applicability.

Abstract

Evaluating the quality of text generated by large language models (LLMs) remains a significant challenge. Traditional metrics often fail to align well with human judgments, particularly in tasks requiring creativity and nuance. In this paper, we propose \textsc{Check-Eval}, a novel evaluation framework leveraging LLMs to assess the quality of generated text through a checklist-based approach. \textsc{Check-Eval} can be employed as both a reference-free and reference-dependent evaluation method, providing a structured and interpretable assessment of text quality. The framework consists of two main stages: checklist generation and checklist evaluation. We validate \textsc{Check-Eval} on two benchmark datasets: Portuguese Legal Semantic Textual Similarity and \textsc{SummEval}. Our results demonstrate that \textsc{Check-Eval} achieves higher correlations with human judgments compared to existing metrics, such as \textsc{G-Eval} and \textsc{GPTScore}, underscoring its potential as a more reliable and effective evaluation framework for natural language generation tasks. The code for our experiments is available at \url{https://anonymous.4open.science/r/check-eval-0DB4}
Paper Structure (21 sections, 4 figures, 3 tables)

This paper contains 21 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of the Check-Eval methodology.
  • Figure 2: Prompt used to generate the checklist from a source text. The blue text is the definition of the evaluation criteria, which is a variable that can be changed according to the desired evaluation criteria (e.g., consistency, coherence, relevance and fluency).
  • Figure 3: Example of a generated checklist based on a source document about climate change. The checklist aims to capture the key points of the source document and serves as a reference for evaluating the candidate summary.
  • Figure 4: Prompt used to evaluate a candidate summary based on the generated checklist. The checklist is specific to the evaluation criteria, in this case, consistency.