Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

Ashim Gupta; Rishanth Rajendhran; Nathan Stringham; Vivek Srikumar; Ana Marasović

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

Ashim Gupta, Rishanth Rajendhran, Nathan Stringham, Vivek Srikumar, Ana Marasović

TL;DR

The paper critically assesses NLP robustness beyond model scale, using more than 20 models across architectures and sizes and evaluating through OOD splits, behavioral CheckLists, contrast sets, and adversarial inputs. It finds that simply increasing model size does not guarantee robustness, with many OOD splits becoming less informative and CheckList gaps persisting even for high-accuracy models; it also reveals fragility in adversarial evaluation methods and proposes more rigorous, defense-aware metrics. The work advocates broader, more nuanced evaluation paradigms that transcend IID benchmarks, including higher-quality data, contrastive testing, and robust adversarial protocols, to better characterize practical robustness. Collectively, it emphasizes that improving NLP robustness requires rethinking evaluation practices as much as model design, since current methods may misrepresent real-world resilience. The findings highlight the need for continued, multi-faceted robustness research, including stress-testing under distribution shifts and more reliable adversarial assessment, to drive truly robust NLP systems.

Abstract

Do larger and more performant models resolve NLP's longstanding robustness issues? We investigate this question using over 20 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) out-of-domain and challenge test sets, (b) behavioral testing with CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all out-of-domain tests provide insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them adequately robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

TL;DR

Abstract

Paper Structure (38 sections, 5 equations, 17 figures, 36 tables)

This paper contains 38 sections, 5 equations, 17 figures, 36 tables.

Introduction
Rethinking Common OOD Splits
Common OOD/Challenge Data Splits.
Models.
Q1: Do commonly used OOD splits remain a valid choice for investigating OOD robustness?
Q2: Are challenge sets still "stressful"?
Highly Accurate Models Still Stumble On The Basics
Background.
Q1: Are we at a stage where accurate models meet the expectations for their capabilities?
Q2: Are larger models more capable?
Better Evaluation Paradigms Exist
Background.
Experimental Setup.
Q1: Is there still a gap between original test sets and their contrastive counterparts?
Q2: Has consistency improved despite gaps?
...and 23 more sections

Figures (17)

Figure 1: Finetuning and evaluation datasets determined by analyzing train-test splits in *ACL/EMNLP publications from 2020--2022 (§\ref{['sec:setup']}). Individual train-test splits are reported in Table \ref{['tab:train_eval2']}. They represent the most common data setups for studying two popular aspects of NLP robustness.
Figure 2: In-domain vs. OOD accuracy of 19 models finetuned for 4 types of classification tasks across 9 training datasets and 1--14 OOD datasets per training set. The dashed line is where OOD accuracy equals in-domain and the dotted where it is at most 3% lower. min(#P) is the number of parameters of the smallest model that achieves the latter. The gray points below the dotted line are linked with data splits that do not appear in the upper right region.
Figure 3: The task performance on the standard test set vs. CheckList performance.
Figure 4: Flan-T5-11B performance with standard measures (accuracy, F1, token-F1) vs. contrast set consistency. The model's instruction finetuning data includes training data for all tasks except CondaQA. Prompts include an instruction, 8 examples, and optionally explanations for chain-of-thought prompting and self-consistency decoding.
Figure 5: The change in the attack success rate (ASR) as measured in prior work (\ref{['eq:asr_prev']}) vs. our robust modification (\ref{['eq:asr_robust']}). TextFooler is used to train the defense and DeepWordBug to fool 19 finetuned models in Table \ref{['tab:models']}.
...and 12 more figures

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

TL;DR

Abstract

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

Authors

TL;DR

Abstract

Table of Contents

Figures (17)