CoAScore: Chain-of-Aspects Prompting for NLG Evaluation
Peiyuan Gong, Jiaxin Mao
TL;DR
CoAScore reframes NLG evaluation as aspect-specific scoring that leverages cross-aspect knowledge through a chain-of-aspects prompting framework. It first generates a chain of relevant aspects for a target facet, pre-scores each aspect, and then synthesizes these insights to evaluate the target aspect with a final Chain-of-Aspects Scoring step. Formulated both in reference-based and reference-free forms, the approach demonstrates higher correlation with human judgments than rule-based, machine-learned, and other LLM-based metrics across five NLG tasks and nine aspects, with benefits growing as more relevant aspects are incorporated. The work includes thorough ablations and case studies to validate the necessity of each stage and the usefulness of LLM-generated relevant aspects, offering a robust, interpretable, and scalable framework for multi-aspect NLG evaluation. The authors also release code and scripts to facilitate adoption and further research in NLG evaluation.
Abstract
Recently, natural language generation (NLG) evaluation has shifted from a single-aspect to a multi-aspect paradigm, allowing for a more accurate assessment. Large language models (LLMs) achieve superior performance on various NLG evaluation tasks. However, current work often employs the LLM to independently evaluate different aspects, which largely ignores the rich correlation between various aspects. To fill this research gap, in this work, we propose an NLG evaluation metric called CoAScore. Powered by LLMs, the CoAScore utilizes multi-aspect knowledge through a CoA (\textbf{C}hain-\textbf{o}f-\textbf{A}spects) prompting framework when assessing the quality of a certain aspect. Specifically, for a given aspect to evaluate, we first prompt the LLM to generate a chain of aspects that are relevant to the target aspect and could be useful for the evaluation. We then collect evaluation scores for each generated aspect, and finally, leverage the knowledge of these aspects to improve the evaluation of the target aspect. We evaluate CoAScore across five NLG evaluation tasks (e.g., summarization, dialog response generation, etc) and nine aspects (e.g., overall quality, relevance, coherence, etc). Our experimental findings highlight that, in comparison to individual aspect evaluation, CoAScore exhibits a higher correlation with human judgments. This improvement significantly outperforms existing unsupervised evaluation metrics, whether for assessing overall quality or other aspects. We also conducted extensive ablation studies to validate the effectiveness of the three stages within the CoAScore framework and conducted case studies to show how the LLM performs in these stages. Our code and scripts are available.
