NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals
Neha Srikanth, Rachel Rudinger
TL;DR
This work introduces atomic decomposition of hypotheses into atomic sub-problems for both traditional NLI and defeasible NLI, enabling granular inspection of inferences and model consistency. It develops a pipeline to generate, prune, and validate atoms (via pruning and human validation) and applies it to SNLI and δ-SNLI, revealing that high overall accuracy can mask inconsistencies at the atomic level. The paper further defines critical atoms and a QUD framework to identify the most influential pieces of information driving defeasible updates, and introduces inferential consistency ($I_C$) to measure cross-context reliability of predictions. Empirically, six language models show varying levels of atomic and inferential consistency, with critical-atom sub-problems yielding stronger signals than full examples, highlighting both the promise and elusiveness of robust, context-aware reasoning in current models. The findings have implications for dataset design, annotation strategies, and evaluation protocols aimed at diagnosing and improving non-monotonic reasoning in NLP systems.
Abstract
Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.
