Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

Yoo Yeon Sung; Maharshi Gor; Eve Fleisig; Ishani Mondal; Jordan Lee Boyd-Graber

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

Yoo Yeon Sung, Maharshi Gor, Eve Fleisig, Ishani Mondal, Jordan Lee Boyd-Graber

TL;DR

AdvScore introduces a human-grounded, IR-based metric to quantify adversarialness and discriminability in NLP benchmarks, addressing the mismatch between human and model performance as models evolve. Built on 2PL-IRT, AdvScore computes a margin capturing human–model gaps, discounts ambiguity, and uses Fisher information to quantify discriminability, yielding a dataset-level score that reflects both adversarialness and informativeness. The authors demonstrate AdvScore on AdvQA, a new crowd- and HITL-driven QA dataset, showing AdvQA maintains stronger adversarial signals over time than baselines and that AdvScore provides a more nuanced assessment than QSR alone. The AdvQA creation pipeline combines adversarial writing, real-time model feedback, and rigorous human evaluation to produce high-quality, realistic adversarial questions useful for robust model evaluation and future dataset design.

Abstract

Adversarial datasets should validate AI robustness by providing samples on which humans perform well, but models do not. However, as models evolve, datasets can become obsolete. Measuring whether a dataset remains adversarial is hindered by the lack of a standardized metric for measuring adversarialness. We propose AdvScore, a human-grounded evaluation metric that assesses a dataset's adversarialness by capturing models' and humans' varying abilities while also identifying poor examples. We then use AdvScore to motivate a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, AdvQA. We apply AdvScore using 9,347 human responses and ten language models' predictions to track model improvement over five years, from 2020 to 2024. AdvScore thus provides guidance for achieving robustness comparable with human capabilities. Furthermore, it helps determine to what extent adversarial datasets continue to pose challenges, ensuring that, rather than reflecting outdated or overly artificial difficulties, they effectively test model capabilities.

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

TL;DR

Abstract

Paper Structure (50 sections, 8 equations, 5 figures, 10 tables)

This paper contains 50 sections, 8 equations, 5 figures, 10 tables.

Introduction: Evaluating Adversarial Datasets Requires Human Answers
Preliminaries of AdvScore: irt
2pl-irt
Advantages of irt over question success rate
AdvScore
Quantifying Adversarialness
Skilled Groups.
Margin Computation.
Accounting for Question Ambiguity.
Measuring Discriminability
Combining into AdvScore
Adversarial Benchmark Evaluation
Adversarial datasets with human responses.
Comparison of adversarial benchmarks.
Chronological evaluation of adversarialness
...and 35 more sections

Figures (5)

Figure 1: AdvScore diagnoses when a question is adversarial (top) and difficult for computers to answer for other reasons (bottom). After collecting candidate questions, we ask humans and computers to answer the questions. The top question (from AdvQA) has a higher AdvScore because it is specific, adversarial, discriminative, high-quality, and realistic. In contrast, the bottom question is ambiguous (e.g., none of humans or models correctly answered due to its ambiguity), which is confirmed by its low AdvScore.
Figure 2: Visualization of key AdvScore components across datasets. For each dataset, we plot: (1) Skill density of skilled humans ($H_{(0)}$) and skilled models ($M_{(0)}$), (2) response correctness probability, $\sigma_{\text{2pl}}(\theta)$ (Eq. \ref{['eq:2pl']}, § \ref{['sec:irt']}) averaged over dataset examples, and (3) Item information function ($\textsc{iif}{}(\theta)$(Eq. \ref{['eq:iif']}, § \ref{['subsec:disc']}). Vertical dashed lines show representative (average) skill levels for humans and models. The gap between human and model probabilities (shaded region between the horizontal lines) indicates adversarialness ($\mu_D$). iif peaks show where questions are most informative, with area under curve signaling total informativeness (discriminability, $\kappa_D$). Key insights:bamboogle has high informativeness but favors models (negative $\mu_D$). TrickMe separates humans and models but has lower discriminability (positive $\mu_D$). AdvQA is the best of all, effectively discriminating between humans and models while maintaining high informativeness throughout, resulting in the highest AdvScore of 0.31.
Figure 3: We report AdvScore for each dataset over the years, confirming that AdvQA holds the highest AdvScore with the smallest decline over the last five years, proving its adversarial robustness.
Figure 4: The overall distribution of LR coefficients suggests that lifestyle and commonsense knowledge contribute more to adversarialness than other features. This implies that models still struggle with commonsense knowledge, highlighting an area where they remain vulnerable compared to human understanding.
Figure 5: As the target answer to the question should be "Apple Inc," the interface is updated with answers from retrieval models with the most relevant sentence and from lms (e.g., Distilbert, T5). Also, the highlights are updated by the input perturbation technique.

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

TL;DR

Abstract

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

Authors

TL;DR

Abstract

Table of Contents

Figures (5)