Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

Nathaniel Weir; Kate Sanders; Orion Weller; Shreya Sharma; Dongwei Jiang; Zhengping Jiang; Bhavana Dalvi Mishra; Oyvind Tafjord; Peter Jansen; Peter Clark; Benjamin Van Durme

Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

Nathaniel Weir, Kate Sanders, Orion Weller, Shreya Sharma, Dongwei Jiang, Zhengping Jiang, Bhavana Dalvi Mishra, Oyvind Tafjord, Peter Jansen, Peter Clark, Benjamin Van Durme

TL;DR

This work introduces RDTE, a principled, informal-logic–driven protocol for annotating decompositional textual entailment, addressing inconsistencies in prior datasets and the need for reliable reasoning steps in entailment trees. It presents a high-quality RDTE dataset and a knowledge-distillation pipeline that leverages GPT-4 to generate silver RDTE annotations, enabling smaller models to achieve strong precision in decomposition validation. Building on this, the authors introduce TreeWise, an entailment-tree engine that integrates backward chaining with forward inference, diverse prompts, and RDTE-based verification to ground hypotheses in verified corpora such as Wikipedia. Across ARC and HotpotQA, TreeWise with RDTE distillation yields superior QA accuracy and higher-tree integrity, highlighting the practical impact of rigorous reasoning protocols for trustworthy NL inference.

Abstract

Recent language models enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment and evaluate its impact on LLM-based textual inference. We find that our new dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency (+9%) than prior decompositional entailment datasets. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality, illustrating the practical benefit of this advance for textual inference.

Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

TL;DR

Abstract

Paper Structure (33 sections, 16 figures, 3 tables)

This paper contains 33 sections, 16 figures, 3 tables.

Introduction
Decompositional RTE
Task Definition
Background: RAS Criteria
Implementing RAS for RTE Annotation
Data Collection
Annotation Process
RDTE Analysis
RDTE Evaluation
RDTE Results
TreeWise
TreeWise Experiments
QA Evaluation
Related Work
Conclusion
...and 18 more sections

Figures (16)

Figure 1: (Upper) Two hypothesis decompositions suggested by an LLM. The first makes an argument that is generally acceptable to a human. The second contains a fact that is not always true and another that is irrelevant to the entailment. Recognizing such an invalid decomposition is core to recent neuro-symbolic reasoning algorithms, but LLMs struggle at the task. (Lower) Ambiguous definitions of entailment have hampered progress in annotating data to improve the models. We find that a faceted definition yields both a clean dataset (RDTE) and significant downstream task improvements.
Figure 2: Distribution of the 1000 entailment labels in RDTE. Instead of binary entail/non-entailment, we annotate on a 5-point ordinal scale. To evaluate binary judgment models, we treat $\geq$4 as positively labeled.
Figure 3: Example RDTE annotations.
Figure 4: TreeWise generates many premise decompositions of a hypothesis and checks whether any candidates are valid entailments. Premises are then recursively decomposed until it finds any tree(s) fully grounded in one or more documents from a corpus like Wikipedia. Statements entailed by documents are generated via forward chaining, while the rest of the search is backward. Many decompositions end up untraversed due to the search budget or nonentailment.
Figure 5: RDTE annotation guidelines for premise-specific qualia in ARC.
...and 11 more figures

Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

TL;DR

Abstract

Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

Authors

TL;DR

Abstract

Table of Contents

Figures (16)