Table of Contents
Fetching ...

DHP Benchmark: Are LLMs Good NLG Evaluators?

Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, Xia Hu

TL;DR

The paper introduces the Discernment of Hierarchical Perturbation (DHP) benchmark to assess LLMs as NLG evaluators by quantifying their ability to detect targeted quality issues through hierarchical perturbations.It re-establishes six evaluation datasets across four NLG tasks and benchmarks five LLM series, using Wilcoxon tests and harmonic mean p-values with expert weights to derive Discernment Scores that are less biased by response styles.Findings indicate that most LLMs can discern quality issues across tasks, with GPT-4 Turbo often the most stable and capable evaluator, while smaller models and certain translation tasks reveal limitations; the framework also highlights task- and dataset-specific strengths and weaknesses.Overall, DHP offers a rigorous, automated, and task-aware approach to evaluating LLM-based evaluators, complementing human judgments and traditional metrics in NLG assessment.

Abstract

Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks; this is often referred to as ``LLM-as-a-judge'' paradigm. However, the capabilities of LLMs in evaluating NLG quality remain underexplored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs. This framework leverages hierarchically perturbed text data and statistical tests to systematically measure the NLG evaluation capabilities of LLMs. We re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM families provides critical insight into their strengths and limitations as NLG evaluators. Our dataset is available at https://huggingface.co/datasets/YCWANGVINCE/DHP_Benchmark.

DHP Benchmark: Are LLMs Good NLG Evaluators?

TL;DR

The paper introduces the Discernment of Hierarchical Perturbation (DHP) benchmark to assess LLMs as NLG evaluators by quantifying their ability to detect targeted quality issues through hierarchical perturbations.It re-establishes six evaluation datasets across four NLG tasks and benchmarks five LLM series, using Wilcoxon tests and harmonic mean p-values with expert weights to derive Discernment Scores that are less biased by response styles.Findings indicate that most LLMs can discern quality issues across tasks, with GPT-4 Turbo often the most stable and capable evaluator, while smaller models and certain translation tasks reveal limitations; the framework also highlights task- and dataset-specific strengths and weaknesses.Overall, DHP offers a rigorous, automated, and task-aware approach to evaluating LLM-based evaluators, complementing human judgments and traditional metrics in NLG assessment.

Abstract

Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks; this is often referred to as ``LLM-as-a-judge'' paradigm. However, the capabilities of LLMs in evaluating NLG quality remain underexplored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs. This framework leverages hierarchically perturbed text data and statistical tests to systematically measure the NLG evaluation capabilities of LLMs. We re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM families provides critical insight into their strengths and limitations as NLG evaluators. Our dataset is available at https://huggingface.co/datasets/YCWANGVINCE/DHP_Benchmark.
Paper Structure (24 sections, 5 equations, 6 figures, 3 tables)

This paper contains 24 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Challenges in Assessing LLMs as NLG Evaluators: Biased Response Styles and Multiple Evaluation Metrics. Our DHP Framework employs hierarchical perturbation and statistical tests to address these challenges, offering quantitative discernment scores for effective comparison.
  • Figure 2: Response styles of five LLMs evaluated using the SummEval dataset fabbri2021summeval.
  • Figure 3: The DHP framework for each NLG task. It includes three steps: (1) Hierarchical Perturbation, (2) LLM Evaluation, and (3) Statistical Analysis. This figure demonstrates the framework with four perturbation types ($P=4$) and three evaluation metrics ($M=3$).
  • Figure 4: The DHP benchmarking results across four NLG tasks. Notably, in (d) for the Question Answering task, $D$ and $D^{EW}$ are identical because this task utilizes only one evaluation metric. The red lines on the charts represent $D$ or $D^{EW}=1$, which indicates the threshold for statistical significance in discernment scores.
  • Figure 5: User interface of the expert weight survey conducted to determine the impact of various quality issues on NLG task metrics.
  • ...and 1 more figures