Estimating the Causal Effects of Natural Logic Features in Neural NLI Models
Julia Rozanova, Marco Valentino, Andre Freitas
TL;DR
This work introduces a causal intervention framework to quantify how semantic features—specifically context monotonicity and word-pair relations—drive NLI model predictions. By constructing a causal diagram for the NLI-XY subtask and estimating total causal effects (TCE) and undesired direct effects (DCE) through interventional data, the authors assess model robustness to surface-form changes and sensitivity to semantic changes. They compare various models, including HELP-fine-tuned variants, finding that high benchmark accuracy does not guarantee strong causal robustness; HELP generally improves sensitivity to context and reduces reliance on lexical cues, though trade-offs with standard benchmarks can occur. The findings advance interpretability in NLP by linking causal influence to model behavior, informing interventions to strengthen reasoning patterns and reduce brittle reliance on superficial cues.
Abstract
Rigorous evaluation of the causal effects of semantic features on language model predictions can be hard to achieve for natural language reasoning problems. However, this is such a desirable form of analysis from both an interpretability and model evaluation perspective, that it is valuable to zone in on specific patterns of reasoning with enough structure and regularity to be able to identify and quantify systematic reasoning failures in widely-used models. In this vein, we pick a portion of the NLI task for which an explicit causal diagram can be systematically constructed: in particular, the case where across two sentences (the premise and hypothesis), two related words/terms occur in a shared context. In this work, we apply causal effect estimation strategies to measure the effect of context interventions (whose effect on the entailment label is mediated by the semantic monotonicity characteristic) and interventions on the inserted word-pair (whose effect on the entailment label is mediated by the relation between these words.). Following related work on causal analysis of NLP models in different settings, we adapt the methodology for the NLI task to construct comparative model profiles in terms of robustness to irrelevant changes and sensitivity to impactful changes.
