Table of Contents
Fetching ...

Principles from Clinical Research for NLP Model Generalization

Aparna Elangovan, Jiayuan He, Yuan Li, Karin Verspoor

TL;DR

This paper argues that NLP generalization cannot be reduced to out-of-distribution concerns alone; it demonstrates that models can rely on spurious surface patterns, motivating a clinical-research inspired framework. Through a relation extraction case study with surrogate explainable models, the authors show that high test-set performance can mask dependence on distance-based surface cues, with correlations to ground-truth labels and BioBERT predictions varying across data splits and generalization sets. They import concepts from clinical research—population definitions, internal validity, external validity, and randomized/matched comparisons—and propose adapting matched-pair evaluation and perturbation-based control sets to NLP. The work highlights that robust generalization assessment requires careful test design, replication, and consideration of LLM-specific sensitivities, with practical guidance for reliable evaluation in real-world NLP deployments.

Abstract

The NLP community typically relies on performance of a model on a held-out test set to assess generalization. Performance drops observed in datasets outside of official test sets are generally attributed to "out-of-distribution" effects. Here, we explore the foundations of generalizability and study the factors that affect it, articulating lessons from clinical studies. In clinical research, generalizability is an act of reasoning that depends on (a) internal validity of experiments to ensure controlled measurement of cause and effect, and (b) external validity or transportability of the results to the wider population. We demonstrate how learning spurious correlations, such as the distance between entities in relation extraction tasks, can affect a model's internal validity and in turn adversely impact generalization. We, therefore, present the need to ensure internal validity when building machine learning models in NLP. Our recommendations also apply to generative large language models, as they are known to be sensitive to even minor semantic preserving alterations. We also propose adapting the idea of matching in randomized controlled trials and observational studies to NLP evaluation to measure causation.

Principles from Clinical Research for NLP Model Generalization

TL;DR

This paper argues that NLP generalization cannot be reduced to out-of-distribution concerns alone; it demonstrates that models can rely on spurious surface patterns, motivating a clinical-research inspired framework. Through a relation extraction case study with surrogate explainable models, the authors show that high test-set performance can mask dependence on distance-based surface cues, with correlations to ground-truth labels and BioBERT predictions varying across data splits and generalization sets. They import concepts from clinical research—population definitions, internal validity, external validity, and randomized/matched comparisons—and propose adapting matched-pair evaluation and perturbation-based control sets to NLP. The work highlights that robust generalization assessment requires careful test design, replication, and consideration of LLM-specific sensitivities, with practical guidance for reliable evaluation in real-world NLP deployments.

Abstract

The NLP community typically relies on performance of a model on a held-out test set to assess generalization. Performance drops observed in datasets outside of official test sets are generally attributed to "out-of-distribution" effects. Here, we explore the foundations of generalizability and study the factors that affect it, articulating lessons from clinical studies. In clinical research, generalizability is an act of reasoning that depends on (a) internal validity of experiments to ensure controlled measurement of cause and effect, and (b) external validity or transportability of the results to the wider population. We demonstrate how learning spurious correlations, such as the distance between entities in relation extraction tasks, can affect a model's internal validity and in turn adversely impact generalization. We, therefore, present the need to ensure internal validity when building machine learning models in NLP. Our recommendations also apply to generative large language models, as they are known to be sensitive to even minor semantic preserving alterations. We also propose adapting the idea of matching in randomized controlled trials and observational studies to NLP evaluation to measure causation.
Paper Structure (24 sections, 7 figures, 4 tables)

This paper contains 24 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Internal validity is a mandatory precursor for any form of external generalization, including cross dataset generalization. Internal validity is required to ensure that the model has learned core linguistic strategies to solve the task within the context of the test set.
  • Figure 2: Decision Tree (NB-T) fit in high confidence predictions in the generalization set
  • Figure 3: Decision Tree (NB-T) fit in Train ground truth fit
  • Figure 4: Decision Tree (NB-T) fit in Test ground truth fit
  • Figure 5: Decision Tree (NB-T) fit in Test set BioBERT predictions fit
  • ...and 2 more figures