Table of Contents
Fetching ...

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

Jie Ruan, Xiao Pu, Mingqi Gao, Xiaojun Wan, Yuesheng Zhu

TL;DR

This work tackles unreliable inter-system rankings in NLG caused by sampling variability in human evaluation. It introduces the Constrained Active Sampling Framework (CASF), a multi-phase, constraint-driven approach that uses a Learner to predict sample quality, a Systematic Sampler to create balanced buckets, and a Constrained Controller to minimize redundancy and preserve representativeness. Across 16 datasets, 5 NLG tasks, and 44 metrics, CASF achieves a Kendall inter-system ranking of $0.83$ and top-ranked system identification accuracy of $93.18\%$, substantially outperforming Random and Heuristic baselines. The method enables more reliable gold-standard human judgments at reduced cost, and the authors release code and data to facilitate adoption in practice.

Abstract

Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking.Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18% top-ranked system recognition accuracy and ranks first or ranks second on 90.91% of the human metrics with 0.83 overall inter-system ranking Kendall correlation.Code and data are publicly available online.

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

TL;DR

This work tackles unreliable inter-system rankings in NLG caused by sampling variability in human evaluation. It introduces the Constrained Active Sampling Framework (CASF), a multi-phase, constraint-driven approach that uses a Learner to predict sample quality, a Systematic Sampler to create balanced buckets, and a Constrained Controller to minimize redundancy and preserve representativeness. Across 16 datasets, 5 NLG tasks, and 44 metrics, CASF achieves a Kendall inter-system ranking of and top-ranked system identification accuracy of , substantially outperforming Random and Heuristic baselines. The method enables more reliable gold-standard human judgments at reduced cost, and the authors release code and data to facilitate adoption in practice.

Abstract

Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking.Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18% top-ranked system recognition accuracy and ranks first or ranks second on 90.91% of the human metrics with 0.83 overall inter-system ranking Kendall correlation.Code and data are publicly available online.
Paper Structure (39 sections, 4 equations, 6 figures, 14 tables)

This paper contains 39 sections, 4 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Conducting human evaluations on different sample subsets (Sub) can obtain different inter-system rankings. The lower part shows the same sampling method obtains different subsets at different sampling times. The upper part shows the ranking obtained from the corresponding subsets. "Sys" represents system and "GT" represents Ground Truth.
  • Figure 2: Constrained Active Sampling Framework
  • Figure 3: Example of systematic sampler and constrained controller cooperating to select final samples
  • Figure 4: Inter-system ranking of human evaluation aspect 'accuracy' of OpenAI 1. "GT" is the inter-system ranking on the entire dataset. Sampling rate is 50%. "Sys" represents system. Rankings in red indicate incorrect rankings.
  • Figure 5: The number of papers that use random sampling and do not mention the sampling method they used (unknown sampling methods) for human evaluation in top NLP conference.
  • ...and 1 more figures