QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation
Weiping Fu, Bifan Wei, Jianxiang Hu, Zhongmin Cai, Jun Liu
TL;DR
QGEval tackles the lack of standardized human evaluation in Question Generation by introducing a multi-dimensional benchmark that assesses generated questions across seven dimensions: fluency, clarity, conciseness, relevance, consistency, answerability, and answer consistency. It builds a 3,000-question dataset derived from 200 passages using 15 QG models (including GPT-4 variants) on SQuAD and HotpotQA, following a two-stage generation-and-annotation pipeline. The study reveals that most models struggle with answerability and answer consistency, while many automatic metrics fail to align with human judgments; multi-dimensional, especially reference-free, metrics and even LLM-based evaluators show promise but do not fully replace human evaluation. By providing both the dataset and accompanying evaluation tools, QGEval aims to drive development in QG models and their evaluation, supporting more reliable comparisons and better-aligned automatic metrics. The work highlights a need for richer evaluation criteria and potentially more discriminative dimensions to capture higher-quality, practically useful questions.
Abstract
Automatically generated questions often suffer from problems such as unclear expression or factual inaccuracies, requiring a reliable and comprehensive evaluation of their quality. Human evaluation is widely used in the field of question generation (QG) and serves as the gold standard for automatic metrics. However, there is a lack of unified human evaluation criteria, which hampers consistent and reliable evaluations of both QG models and automatic metrics. To address this, we propose QGEval, a multi-dimensional Evaluation benchmark for Question Generation, which evaluates both generated questions and existing automatic metrics across 7 dimensions: fluency, clarity, conciseness, relevance, consistency, answerability, and answer consistency. We demonstrate the appropriateness of these dimensions by examining their correlations and distinctions. Through consistent evaluations of QG models and automatic metrics with QGEval, we find that 1) most QG models perform unsatisfactorily in terms of answerability and answer consistency, and 2) existing metrics fail to align well with human judgments when evaluating generated questions across the 7 dimensions. We expect this work to foster the development of both QG technologies and their evaluation.
