MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference
Mădălina Zgreabăn, Tejaswini Deoskar, Lasha Abzianidze
TL;DR
This work tackles poor out-of-distribution generalization in natural language inference (NLI) by introducing MERGE, a method that automatically generates minimally altered, reasoning-preserving variants of NLI problems through open-class word replacements guided by masked language models. The authors formalize a pipeline that replaces shared words in the premise and hypothesis with contextually plausible alternatives, while enforcing constraints to keep word overlap and syntax intact. They evaluate a broad set of NLI models on a large, variant-rich dataset and find substantial generalization gaps: pattern accuracy on variants lags far behind standard seed accuracy, and achieving parity requires substantially lower thresholds, especially as the variant set grows. Analyses reveal nouns and adjectives pose greater challenges than verbs, incumbent model biases toward particular MLM origins are limited, and the number of unique variants often drives scores more than filtering criteria. Overall, MERGE provides a scalable, model-agnostic framework for robust generalization testing beyond traditional in-distribution benchmarks and suggests directions for extending to other NLP tasks and future refinements such as subword tokenization effects and prompt-based evaluation.
Abstract
In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.
