Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

Oyvind Tafjord; Bhavana Dalvi Mishra; Peter Clark

Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark

TL;DR

Entailer introduces a QA framework that makes its reasoning explicit by generating and self-verifying backward-chaining entailment trees that reflect the system's internal beliefs. It combines hypothesis generation with multi-step proofs, trained on EntailmentBank and crowdsourced data, and operates zero-shot on new tasks. Empirical results show proof-based QA achieves accuracy comparable to direct QA while providing faithful and truthful reasoning traces, with human judges favoring the clarity and reliability of these chains. The work also outlines a path toward teachable, interactive systems where user corrections can modify the model's beliefs and proofs over time.

Abstract

Our goal is a question-answering (QA) system that can show how its answers are implied by its own internal beliefs via a systematic chain of reasoning. Such a capability would allow better understanding of why a model produced the answer it did. Our approach is to recursively combine a trained backward-chaining model, capable of generating a set of premises entailing an answer hypothesis, with a verifier that checks that the model itself believes those premises (and the entailment itself) through self-querying. To our knowledge, this is the first system to generate multistep chains that are both faithful (the answer follows from the reasoning) and truthful (the chain reflects the system's own internal beliefs). In evaluation using two different datasets, users judge that a majority (70%+) of generated chains clearly show how an answer follows from a set of facts - substantially better than a high-performance baseline - while preserving answer accuracy. By materializing model beliefs that systematically support an answer, new opportunities arise for understanding the model's system of belief, and diagnosing and correcting its misunderstandings when an answer is wrong.

Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

TL;DR

Abstract

Paper Structure (24 sections, 8 figures, 1 table)

This paper contains 24 sections, 8 figures, 1 table.

Introduction
Related Work
Approach
Hypothesis Generation
Generating Entailment Trees
Generating a Backward-Chaining Step
Models:
Algorithm:
Backward Chaining
Model Training
Data Sources
EntailmentBank
Crowdsourced Data
Optional Fields
Model Details
...and 9 more sections

Figures (8)

Figure 1: Given a question, Entailer searches for an answer hypothesis that is supported by an entailment proof. First it over-generates candidate proofs, then it removes those that the model itself does not “believe” (i.e., confirms via self-querying that it considers all the generated proof elements to be true). Finally it selects the best verified proof. Multistep proofs are generated by iteratively backward chaining on the premises (Section \ref{['generating-entailment-trees']}).
Figure 2: Questions (from the OBQA dataset) and Entailer's answers, showing its chain of reasoning.
Figure 3: An entailment tree is a set of multi-premise, 1-step entailments (red boxes) showing how the hypothesis (root node, green) is entailed from the leaf nodes (white). If all the leaf nodes are true, and all the 1-step entailment relations are valid, then we say the tree is a valid chain of reasoning for the hypothesis.
Figure 4: The entailment tree is grown recursively, the algorithm looking for the best tree (maximizes the overall score of the root node). Each node has a fixed, direct (“fast”) score (in red), and (for internal nodes) a proof (“slow”) score (in blue) computed from its children. The overall node score (highlighted) is the highest of the two. If expanding a node increases its overall score (e.g., step 3), that increase is propagated upwards and recursion continues. If expansions cannot improve a node’s score further (e.g., steps 2 and 4), the expansions are pruned and that node becomes a leaf (red bars).
Figure 5: QA accuracy of Direct QA, Entailer, and the two combined on two datasets.
...and 3 more figures

Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

TL;DR

Abstract

Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)