A Survey on Out-of-Distribution Evaluation of Neural NLP Models

Xinzhe Li; Ming Liu; Shang Gao; Wray Buntine

A Survey on Out-of-Distribution Evaluation of Neural NLP Models

Xinzhe Li, Ming Liu, Shang Gao, Wray Buntine

TL;DR

This work addresses the challenge that neural NLP models often falter under out-of-distribution conditions by unifying adversarial robustness, domain generalization, and dataset biases under the umbrella of distribution shift. It clarifies data-generating processes—natural domain shift ($NDS$), debiased data, and adversarially perturbed data—and evaluation paradigms (data-based and method-based), linking each to covariate-shift concepts and shifted features. The authors offer a framework that connects the three OOD lines, highlights the role of semantic versus background features and biased features, and proposes opportunities such as a comprehensive benchmarking suite and cross-line detection approaches. They also discuss challenges, including covariate-shift assumptions, realism of adversarial attacks, and the need to achieve truly generalizable OOD performance across diverse NLP tasks.

Abstract

Adversarial robustness, domain generalization and dataset biases are three active lines of research contributing to out-of-distribution (OOD) evaluation on neural NLP models. However, a comprehensive, integrated discussion of the three research lines is still lacking in the literature. In this survey, we 1) compare the three lines of research under a unifying definition; 2) summarize the data-generating processes and evaluation protocols for each line of research; and 3) emphasize the challenges and opportunities for future work.

A Survey on Out-of-Distribution Evaluation of Neural NLP Models

TL;DR

), debiased data, and adversarially perturbed data—and evaluation paradigms (data-based and method-based), linking each to covariate-shift concepts and shifted features. The authors offer a framework that connects the three OOD lines, highlights the role of semantic versus background features and biased features, and proposes opportunities such as a comprehensive benchmarking suite and cross-line detection approaches. They also discuss challenges, including covariate-shift assumptions, realism of adversarial attacks, and the need to achieve truly generalizable OOD performance across diverse NLP tasks.

Abstract

Paper Structure (43 sections, 2 equations, 4 tables)

This paper contains 43 sections, 2 equations, 4 tables.

Introduction
Definition
Distribution Shift
Domain generalization and dataset biases.
Adversarial robustness: from robustness to distribution shift.
Shifted Features
Background features and semantic features.
Biased features are task-irrelevant features under $\mathbb{P}_\text{true}$ despite being learned as task-relevant features from $\mathbb{P}_0$.
Shifted Features in Three OOD Types
Shifted features on adversarial examples.
Shifted features on debiased data.
Shifted features on NDS data.
OOD Performance Evaluation
NDS Data Generation
Genres.
...and 28 more sections

A Survey on Out-of-Distribution Evaluation of Neural NLP Models

TL;DR

Abstract

A Survey on Out-of-Distribution Evaluation of Neural NLP Models

Authors

TL;DR

Abstract

Table of Contents