DHP Benchmark: Are LLMs Good NLG Evaluators?
Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, Xia Hu
TL;DR
The paper introduces the Discernment of Hierarchical Perturbation (DHP) benchmark to assess LLMs as NLG evaluators by quantifying their ability to detect targeted quality issues through hierarchical perturbations.It re-establishes six evaluation datasets across four NLG tasks and benchmarks five LLM series, using Wilcoxon tests and harmonic mean p-values with expert weights to derive Discernment Scores that are less biased by response styles.Findings indicate that most LLMs can discern quality issues across tasks, with GPT-4 Turbo often the most stable and capable evaluator, while smaller models and certain translation tasks reveal limitations; the framework also highlights task- and dataset-specific strengths and weaknesses.Overall, DHP offers a rigorous, automated, and task-aware approach to evaluating LLM-based evaluators, complementing human judgments and traditional metrics in NLG assessment.
Abstract
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks; this is often referred to as ``LLM-as-a-judge'' paradigm. However, the capabilities of LLMs in evaluating NLG quality remain underexplored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs. This framework leverages hierarchically perturbed text data and statistical tests to systematically measure the NLG evaluation capabilities of LLMs. We re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM families provides critical insight into their strengths and limitations as NLG evaluators. Our dataset is available at https://huggingface.co/datasets/YCWANGVINCE/DHP_Benchmark.
