Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness
Yoo Yeon Sung, Maharshi Gor, Eve Fleisig, Ishani Mondal, Jordan Lee Boyd-Graber
TL;DR
AdvScore introduces a human-grounded, IR-based metric to quantify adversarialness and discriminability in NLP benchmarks, addressing the mismatch between human and model performance as models evolve. Built on 2PL-IRT, AdvScore computes a margin capturing human–model gaps, discounts ambiguity, and uses Fisher information to quantify discriminability, yielding a dataset-level score that reflects both adversarialness and informativeness. The authors demonstrate AdvScore on AdvQA, a new crowd- and HITL-driven QA dataset, showing AdvQA maintains stronger adversarial signals over time than baselines and that AdvScore provides a more nuanced assessment than QSR alone. The AdvQA creation pipeline combines adversarial writing, real-time model feedback, and rigorous human evaluation to produce high-quality, realistic adversarial questions useful for robust model evaluation and future dataset design.
Abstract
Adversarial datasets should validate AI robustness by providing samples on which humans perform well, but models do not. However, as models evolve, datasets can become obsolete. Measuring whether a dataset remains adversarial is hindered by the lack of a standardized metric for measuring adversarialness. We propose AdvScore, a human-grounded evaluation metric that assesses a dataset's adversarialness by capturing models' and humans' varying abilities while also identifying poor examples. We then use AdvScore to motivate a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, AdvQA. We apply AdvScore using 9,347 human responses and ten language models' predictions to track model improvement over five years, from 2020 to 2024. AdvScore thus provides guidance for achieving robustness comparable with human capabilities. Furthermore, it helps determine to what extent adversarial datasets continue to pose challenges, ensuring that, rather than reflecting outdated or overly artificial difficulties, they effectively test model capabilities.
