MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Mădălina Zgreabăn; Tejaswini Deoskar; Lasha Abzianidze

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Mădălina Zgreabăn, Tejaswini Deoskar, Lasha Abzianidze

TL;DR

This work tackles poor out-of-distribution generalization in natural language inference (NLI) by introducing MERGE, a method that automatically generates minimally altered, reasoning-preserving variants of NLI problems through open-class word replacements guided by masked language models. The authors formalize a pipeline that replaces shared words in the premise and hypothesis with contextually plausible alternatives, while enforcing constraints to keep word overlap and syntax intact. They evaluate a broad set of NLI models on a large, variant-rich dataset and find substantial generalization gaps: pattern accuracy on variants lags far behind standard seed accuracy, and achieving parity requires substantially lower thresholds, especially as the variant set grows. Analyses reveal nouns and adjectives pose greater challenges than verbs, incumbent model biases toward particular MLM origins are limited, and the number of unique variants often drives scores more than filtering criteria. Overall, MERGE provides a scalable, model-agnostic framework for robust generalization testing beyond traditional in-distribution benchmarks and suggests directions for extending to other NLP tasks and future refinements such as subword tokenization effects and prompt-based evaluation.

Abstract

In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 11 figures, 9 tables)

This paper contains 34 sections, 1 equation, 11 figures, 9 tables.

Introduction
Related Work
Generalizability in NLI
Modification Strategy
Creation Type
Validation
Meaning, Reasoning, Word Overlap and Syntax
Modified Sentence
Evaluation
Results of previous studies
Shortcomings of previous variant datasets
MERGE
Methodology
Experimental Setup
Suggestion generation
...and 19 more sections

Figures (11)

Figure 1: MERGE vs. standard sample-based evaluations: while in the former each variant is an independent example, in MERGE performance is measured as the proportion of correctly classified variants for an NLI seed problem, i.e. whether a model classifies at least x amount of variants (the threshold number) for each NLI problem.
Figure 2: Generating NLI problem variants with MLMs. The suggestions of a shared word between $P\&H$ are excluded if they have different $classes$ (teen for little) than $w_i$, or if they are already part of the problem (girl).
Figure 3: Averaged Fluency and Reasoning scores with normalized counts for 100 random variants for Nouns, Verbs, and Adjectives. The red lines are the bar plots weighted considering the distribution of classes in the seed NLI problems (N=67%, V=23%, ADJ=10%). Good variants have a score of $F+R >=9$.
Figure 4: PA scores of models on ALL$_\text{Var}$ from 80% threshold on. The red dots are PA scores at QT of ALL$_\text{Var}$ (90%).
Figure 5: Averaged PA scores on N$_\text{Var}$, V$_\text{Var}$, and A$_\text{Var}$.
...and 6 more figures

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

TL;DR

Abstract

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (11)