HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

Yuxuan Liu; Tianchi Yang; Shaohan Huang; Zihan Zhang; Haizhen Huang; Furu Wei; Weiwei Deng; Feng Sun; Qi Zhang

HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang

TL;DR

HD-Eval introduces a hierarchical criteria decomposition framework to align LLM-based evaluators with human preferences. By iteratively decomposing evaluation tasks, learning a human-preference-aware white-box aggregator, and performing attribution-based pruning, it produces multi-level, explainable evaluation criteria that better capture natural language qualities. Across summarization, conversation, and data-to-text tasks, HD-Eval yields stronger alignment with human judgments than strong baselines, and remains effective under limited human data. The approach enhances transparency and efficiency in evaluating NLG systems and offers practical insights into human evaluation priorities.

Abstract

Large language models (LLMs) have emerged as a promising alternative to expensive human evaluations. However, the alignment and coverage of LLM-based evaluations are often limited by the scope and potential bias of the evaluation prompts and criteria. To address this challenge, we propose HD-Eval, a novel framework that iteratively aligns LLM-based evaluators with human preference via Hierarchical Criteria Decomposition. HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators by decomposing a given evaluation task into finer-grained criteria, aggregating them according to estimated human preferences, pruning insignificant criteria with attribution, and further decomposing significant criteria. By integrating these steps within an iterative alignment training process, we obtain a hierarchical decomposition of criteria that comprehensively captures aspects of natural language at multiple levels of granularity. Implemented as a white box, the human preference-guided aggregator is efficient to train and more explainable than relying solely on prompting, and its independence from model parameters makes it applicable to closed-source LLMs. Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators and providing deeper insights into the explanation of evaluation results and the task itself.

HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

TL;DR

Abstract

Paper Structure (40 sections, 3 equations, 13 figures, 12 tables)

This paper contains 40 sections, 3 equations, 13 figures, 12 tables.

Introduction
Methodology
Hierarchical Criteria Decomposition
Criteria Decomposition with LLMs
Hierarchy-Aware Prompting
Human Preference-Guided Aggregation
Attribution Pruning
Iterative Alignment Training Framework
Experiments
Experimental Setup
Datasets and Evaluations
Baselines
Models and Configurations
Experimental Results
Human Alignment
...and 25 more sections

Figures (13)

Figure 1: Overall framework of HD-Eval. Starting from the evaluation task, HD-Eval iteratively decomposes it to different aspects, trains an aggregator, then select significant criteria with attribution pruning for further expansion at the next layer. The aggregator and decomposition are finalized after reaching the maximum layer count.
Figure 2: Illustration on hierarchical criteria decomposition and iterative alignment training of HD-Eval. A formal description of the iterative alignment training procedure of HD-Eval is elaborated in Algorithm \ref{['alg:mainalgo']}.
Figure 3: A case study for criteria decomposition on Topical-Chat. White, blue and orange boxes denote decomposed criteria at 1, 2 and 3 hierarchy. Underlined denote criteria being selected with attribution pruning.
Figure 4: Performance of HD-Eval under different training data counts on Topical-Chat, averaged over 5 seeds.
Figure 5: Criteria efficiency of HD-Eval on Topical-Chat. Results are averaged over 5 random samples.
...and 8 more figures

HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

TL;DR

Abstract

HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

Authors

TL;DR

Abstract

Table of Contents

Figures (13)