Table of Contents
Fetching ...

CoAScore: Chain-of-Aspects Prompting for NLG Evaluation

Peiyuan Gong, Jiaxin Mao

TL;DR

CoAScore reframes NLG evaluation as aspect-specific scoring that leverages cross-aspect knowledge through a chain-of-aspects prompting framework. It first generates a chain of relevant aspects for a target facet, pre-scores each aspect, and then synthesizes these insights to evaluate the target aspect with a final Chain-of-Aspects Scoring step. Formulated both in reference-based and reference-free forms, the approach demonstrates higher correlation with human judgments than rule-based, machine-learned, and other LLM-based metrics across five NLG tasks and nine aspects, with benefits growing as more relevant aspects are incorporated. The work includes thorough ablations and case studies to validate the necessity of each stage and the usefulness of LLM-generated relevant aspects, offering a robust, interpretable, and scalable framework for multi-aspect NLG evaluation. The authors also release code and scripts to facilitate adoption and further research in NLG evaluation.

Abstract

Recently, natural language generation (NLG) evaluation has shifted from a single-aspect to a multi-aspect paradigm, allowing for a more accurate assessment. Large language models (LLMs) achieve superior performance on various NLG evaluation tasks. However, current work often employs the LLM to independently evaluate different aspects, which largely ignores the rich correlation between various aspects. To fill this research gap, in this work, we propose an NLG evaluation metric called CoAScore. Powered by LLMs, the CoAScore utilizes multi-aspect knowledge through a CoA (\textbf{C}hain-\textbf{o}f-\textbf{A}spects) prompting framework when assessing the quality of a certain aspect. Specifically, for a given aspect to evaluate, we first prompt the LLM to generate a chain of aspects that are relevant to the target aspect and could be useful for the evaluation. We then collect evaluation scores for each generated aspect, and finally, leverage the knowledge of these aspects to improve the evaluation of the target aspect. We evaluate CoAScore across five NLG evaluation tasks (e.g., summarization, dialog response generation, etc) and nine aspects (e.g., overall quality, relevance, coherence, etc). Our experimental findings highlight that, in comparison to individual aspect evaluation, CoAScore exhibits a higher correlation with human judgments. This improvement significantly outperforms existing unsupervised evaluation metrics, whether for assessing overall quality or other aspects. We also conducted extensive ablation studies to validate the effectiveness of the three stages within the CoAScore framework and conducted case studies to show how the LLM performs in these stages. Our code and scripts are available.

CoAScore: Chain-of-Aspects Prompting for NLG Evaluation

TL;DR

CoAScore reframes NLG evaluation as aspect-specific scoring that leverages cross-aspect knowledge through a chain-of-aspects prompting framework. It first generates a chain of relevant aspects for a target facet, pre-scores each aspect, and then synthesizes these insights to evaluate the target aspect with a final Chain-of-Aspects Scoring step. Formulated both in reference-based and reference-free forms, the approach demonstrates higher correlation with human judgments than rule-based, machine-learned, and other LLM-based metrics across five NLG tasks and nine aspects, with benefits growing as more relevant aspects are incorporated. The work includes thorough ablations and case studies to validate the necessity of each stage and the usefulness of LLM-generated relevant aspects, offering a robust, interpretable, and scalable framework for multi-aspect NLG evaluation. The authors also release code and scripts to facilitate adoption and further research in NLG evaluation.

Abstract

Recently, natural language generation (NLG) evaluation has shifted from a single-aspect to a multi-aspect paradigm, allowing for a more accurate assessment. Large language models (LLMs) achieve superior performance on various NLG evaluation tasks. However, current work often employs the LLM to independently evaluate different aspects, which largely ignores the rich correlation between various aspects. To fill this research gap, in this work, we propose an NLG evaluation metric called CoAScore. Powered by LLMs, the CoAScore utilizes multi-aspect knowledge through a CoA (\textbf{C}hain-\textbf{o}f-\textbf{A}spects) prompting framework when assessing the quality of a certain aspect. Specifically, for a given aspect to evaluate, we first prompt the LLM to generate a chain of aspects that are relevant to the target aspect and could be useful for the evaluation. We then collect evaluation scores for each generated aspect, and finally, leverage the knowledge of these aspects to improve the evaluation of the target aspect. We evaluate CoAScore across five NLG evaluation tasks (e.g., summarization, dialog response generation, etc) and nine aspects (e.g., overall quality, relevance, coherence, etc). Our experimental findings highlight that, in comparison to individual aspect evaluation, CoAScore exhibits a higher correlation with human judgments. This improvement significantly outperforms existing unsupervised evaluation metrics, whether for assessing overall quality or other aspects. We also conducted extensive ablation studies to validate the effectiveness of the three stages within the CoAScore framework and conducted case studies to show how the LLM performs in these stages. Our code and scripts are available.
Paper Structure (30 sections, 4 equations, 4 figures, 15 tables)

This paper contains 30 sections, 4 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: The overall prompting framework of CoAScore. Given the evaluation task instruction $\boldsymbol{t}$, evaluation aspect $a$, source $\boldsymbol{s}$ and hypothesis $\boldsymbol{h}$, CoAScore needs to measure the quality of the hypothesis in that aspect. CoAScore consists of three distinct stages, and each stage is carried out by LLM: (I) Generating a chain of aspects that will be used as references when evaluating the target aspect. These generated aspects are chosen to be closely related to the target aspect; (II) Scoring each of the generated aspects for the hypothesis; (III) Leveraging the knowledge about the chain of relevant aspects to enhance the evaluation capability for the specific target aspect. Some detailed information, such as conversation context and replies, has been omitted in the prompts and can be found in Appendix A.
  • Figure 2: Effectiveness of the Relevant Aspect Scoring stage. Owing to the absence of reference scores, the performance of CoAScore$_{w/o\,score}$ falls behind that of LLMScore. Furthermore, assigning random scores to relevant aspects seriously distort the evaluation of a specific aspect, resulting in the weakest correlation scores of CoAScore$_{random}$.
  • Figure 3: Examples of Relevant Aspect Generation in evaluating the overall quality of dialogue responses and the coherence of summaries. Each one provides five relevant aspects as references to help the target aspect evaluation.
  • Figure 4: Effectiveness of vairous relevant aspect numbers. As the number of relevant aspects increases, the correlation scores of CoAScore is generally improved and always better than the ones of LLMScore.