DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation

Minzhi Li; Zhengyuan Liu; Shumin Deng; Shafiq Joty; Nancy F. Chen; Min-Yen Kan

DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation

Minzhi Li, Zhengyuan Liu, Shumin Deng, Shafiq Joty, Nancy F. Chen, Min-Yen Kan

TL;DR

DnA-Eval introduces a pedagogy-inspired, two-stage framework that decomposes open-ended text evaluation into contextually generated aspects and then aggregates per-aspect scores using model-generated weights and an external calculator. By making intermediate outputs (aspects and weights) explicit, the approach improves interpretability while achieving consistent performance gains across multiple meta-evaluation benchmarks and models, including both proprietary and open-source LLMs. The work demonstrates that externalized aggregation and task-adaptive aspect generation yield higher agreement with human judgments than direct scoring or Chain-of-Thought prompting, and it offers insights into how LLM evaluators weigh different criteria across domains. This framework advances reliable, transparent LLM-based evaluation with potential for broader tool-augmented, human-in-the-loop assessment of generated text.

Abstract

The acceleration of Large Language Models (LLMs) research has opened up new possibilities for evaluating generated texts. They serve as scalable and economical evaluators, but the question of how reliable these evaluators are has emerged as a crucial research question. Prior research efforts in the meta-evaluation of LLMs as judges limit the prompting of an LLM to a single use to obtain a final evaluation decision. They then compute the agreement between LLMs' outputs and human labels. This lacks interpretability in understanding the evaluation capability of LLMs. In light of this challenge, we propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices. Our experiments illustrate that it not only provides a more interpretable window for how well LLMs evaluate, but also leads to improvements up to 39.6% for different LLMs on a variety of meta-evaluation benchmarks.

DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation

TL;DR

Abstract

Paper Structure (47 sections, 4 equations, 5 figures, 11 tables)

This paper contains 47 sections, 4 equations, 5 figures, 11 tables.

Introduction
Related Work
Automatic Text Evaluation.
LLM-based Text Evaluation.
Meta-Evaluation of LLMs as Evaluators.
DnA-Eval Framework
Aspect Generation
Pairwise Scoring by Aspect
Aggregation
Experiments
FairEval
MT-Bench
LLMBar
InstruSum
Experimental Setup
...and 32 more sections

Figures (5)

Figure 1: Different from most previous work which asks LLMs directly for its preference over two responses, our proposed DnA-Eval framework takes inspirations from key components used in evaluation rubrics in pedagogy. It consists of criteria proposal, pairwise rating by aspect and aggregation of aspect-wise scores. This framework enhances the transparency, accountability and interpretability of the black-box evaluation process.
Figure 2: Different stages of DnA-Eval. In the decomposition stage, LLMs are provided with the context to propose $k$ different evaluation aspects. These aspects are combined with the context and candidate responses for LLMs to generate pairwise scores for each aspect. LLMs will also be prompted to provide respective weightings for each aspect with the given context. In the aggregation stage, external computing tool can be used to calculate the overall scores for each response and make comparison to decide on the better response.
Figure 3: Agreement with human annotators with varied number of aspects. We also report the baseline performance of direct prompting in dashed lines. Our framework generally outperforms the baseline regardless of number of aspects chosen.
Figure 4: Average model-generated weightings for writing and math tasks in MTBench dataset. We report weightings for creativity and accuracy which are task-dependent dimensions. The figure shows all models are able to assign lower weightings for creativity and higher weightings for accuracy for math problems compared to writing tasks. This suggests about their capability in generating weightings that are helpful for evaluation.
Figure 5: Kendall's $\tau$ distance for aspect weightings between different language models and human. We visualize the rank distance between two different human annotators in dotted lines for a comparison.

DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation

TL;DR

Abstract

DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)