DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation
Minzhi Li, Zhengyuan Liu, Shumin Deng, Shafiq Joty, Nancy F. Chen, Min-Yen Kan
TL;DR
DnA-Eval introduces a pedagogy-inspired, two-stage framework that decomposes open-ended text evaluation into contextually generated aspects and then aggregates per-aspect scores using model-generated weights and an external calculator. By making intermediate outputs (aspects and weights) explicit, the approach improves interpretability while achieving consistent performance gains across multiple meta-evaluation benchmarks and models, including both proprietary and open-source LLMs. The work demonstrates that externalized aggregation and task-adaptive aspect generation yield higher agreement with human judgments than direct scoring or Chain-of-Thought prompting, and it offers insights into how LLM evaluators weigh different criteria across domains. This framework advances reliable, transparent LLM-based evaluation with potential for broader tool-augmented, human-in-the-loop assessment of generated text.
Abstract
The acceleration of Large Language Models (LLMs) research has opened up new possibilities for evaluating generated texts. They serve as scalable and economical evaluators, but the question of how reliable these evaluators are has emerged as a crucial research question. Prior research efforts in the meta-evaluation of LLMs as judges limit the prompting of an LLM to a single use to obtain a final evaluation decision. They then compute the agreement between LLMs' outputs and human labels. This lacks interpretability in understanding the evaluation capability of LLMs. In light of this challenge, we propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices. Our experiments illustrate that it not only provides a more interpretable window for how well LLMs evaluate, but also leads to improvements up to 39.6% for different LLMs on a variety of meta-evaluation benchmarks.
