Table of Contents
Fetching ...

NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals

Neha Srikanth, Rachel Rudinger

TL;DR

This work introduces atomic decomposition of hypotheses into atomic sub-problems for both traditional NLI and defeasible NLI, enabling granular inspection of inferences and model consistency. It develops a pipeline to generate, prune, and validate atoms (via pruning and human validation) and applies it to SNLI and δ-SNLI, revealing that high overall accuracy can mask inconsistencies at the atomic level. The paper further defines critical atoms and a QUD framework to identify the most influential pieces of information driving defeasible updates, and introduces inferential consistency ($I_C$) to measure cross-context reliability of predictions. Empirically, six language models show varying levels of atomic and inferential consistency, with critical-atom sub-problems yielding stronger signals than full examples, highlighting both the promise and elusiveness of robust, context-aware reasoning in current models. The findings have implications for dataset design, annotation strategies, and evaluation protocols aimed at diagnosing and improving non-monotonic reasoning in NLP systems.

Abstract

Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.

NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals

TL;DR

This work introduces atomic decomposition of hypotheses into atomic sub-problems for both traditional NLI and defeasible NLI, enabling granular inspection of inferences and model consistency. It develops a pipeline to generate, prune, and validate atoms (via pruning and human validation) and applies it to SNLI and δ-SNLI, revealing that high overall accuracy can mask inconsistencies at the atomic level. The paper further defines critical atoms and a QUD framework to identify the most influential pieces of information driving defeasible updates, and introduces inferential consistency () to measure cross-context reliability of predictions. Empirically, six language models show varying levels of atomic and inferential consistency, with critical-atom sub-problems yielding stronger signals than full examples, highlighting both the promise and elusiveness of robust, context-aware reasoning in current models. The findings have implications for dataset design, annotation strategies, and evaluation protocols aimed at diagnosing and improving non-monotonic reasoning in NLP systems.

Abstract

Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.

Paper Structure

This paper contains 44 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Top: Atomic hypothesis decomposition breaks down hypotheses ($H$) into entailed propositional "atoms" ($a_1-a_3$). Middle: Pairing the premise ($P$) with each atom yields a set of NLI sub-problems ($P+a$); the sub-problem labels predict the full NLI problem ($P+H$) label. Bottom: Paired with an update ($U$), each atom yields a defeasible NLI sub-problem ($P+a+U$); the set of sub-problem labels are predictive of the full problem ($P+H+U$) label, but the non-monotonic relationship is more complex than for traditional NLI.
  • Figure 2: A rug plot visualization of 1,761 $\delta$-Snli instances and their corresponding distribution of atomic sub-problem labels. Each vertical slice represents one full $\delta$-Snli instance. Slice color (red or green) represents the full instance label (weakener or strengthener). For each $\delta$-Snli problem, we manually label each corresponding atomic sub-problem on a -2 (strongly weakens) to +2 (strongly strengthens) scale. Each vertical slice uses shading (light/dark) to represent the resulting distribution of atomic sub-problem labels (-2 to +2). Slices are ordered left to right by proportion of weakener labels, showing relatively high separation between red and green instances. When atomic sub-problems contain a mix of positive and negative labels, the full problem label may be a strengthener or a weakener, as illustrated by the two center-most exemplars.
  • Figure 3: Updates ($U$) may act on the same hypothesis $H$ in different ways by targeting different atoms. Here, each $U$ strongly targets a different atom, while having no effect on the other atoms derived from $H$ (e.g. the $U$ in the first row has no effect on $a_3$ in the last row). We refer to the atom(s) which an update most strongly affects as the "critical" atom of the $(P, H, U)$$\delta$-nli example. Critical atoms help identify the question under discussion of the example.
  • Figure 4: Grouping examples by their critical atom(s) allows us to understand under which contexts ($P + U$) a model has understood a piece of knowledge. Here, we show two $\delta$-nli examples that evaluate the same atom (top): one that strengthens it (left), and one that weakens it (right). A model that truly understands a fact and the factors that influence it (or, conversely does not) should yield consistently correct or incorrect predictions. However, some models have mixed accuracy among examples targeting the same atom, indicating that they only understand the inference under some contexts.
  • Figure 5: Distribution of fine-grained labels across all atoms in $\delta$-Snli-test.
  • ...and 2 more figures