Table of Contents
Fetching ...

PRover: Proof Generation for Interpretable Reasoning over Rules

Swarnadeep Saha, Sayan Ghosh, Shashank Srivastava, Mohit Bansal

TL;DR

PRover tackles interpretable reasoning over natural-language rule-bases by jointly predicting binary answers and explicit proof graphs. It combines a RoBERTa-based QA module with dedicated Node and Edge components and enforces global constraints via ILP to generate valid proofs. The approach achieves QA performance on par with RuleTakers, reaches 87% proof accuracy, and shows strong zero-shot and depth-generalization, including near-perfect QA with limited data. This work provides a scalable, explainable framework for multi-hop reasoning in fuzzy-rule domains and opens avenues for integrating structured proofs into NLP tasks.

Abstract

Recent work by Clark et al. (2020) shows that transformers can act as 'soft theorem provers' by answering questions over explicitly provided knowledge in natural language. In our work, we take a step closer to emulating formal theorem provers, by proposing PROVER, an interpretable transformer-based model that jointly answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. During inference, a valid proof, satisfying a set of global constraints is generated. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation, with strong generalization performance. First, PROVER generates proofs with an accuracy of 87%, while retaining or improving performance on the QA task, compared to RuleTakers (up to 6% improvement on zero-shot evaluation). Second, when trained on questions requiring lower depths of reasoning, it generalizes significantly better to higher depths (up to 15% improvement). Third, PROVER obtains near perfect QA accuracy of 98% using only 40% of the training data. However, generating proofs for questions requiring higher depths of reasoning becomes challenging, and the accuracy drops to 65% for 'depth 5', indicating significant scope for future work. Our code and models are publicly available at https://github.com/swarnaHub/PRover

PRover: Proof Generation for Interpretable Reasoning over Rules

TL;DR

PRover tackles interpretable reasoning over natural-language rule-bases by jointly predicting binary answers and explicit proof graphs. It combines a RoBERTa-based QA module with dedicated Node and Edge components and enforces global constraints via ILP to generate valid proofs. The approach achieves QA performance on par with RuleTakers, reaches 87% proof accuracy, and shows strong zero-shot and depth-generalization, including near-perfect QA with limited data. This work provides a scalable, explainable framework for multi-hop reasoning in fuzzy-rule domains and opens avenues for integrating structured proofs into NLP tasks.

Abstract

Recent work by Clark et al. (2020) shows that transformers can act as 'soft theorem provers' by answering questions over explicitly provided knowledge in natural language. In our work, we take a step closer to emulating formal theorem provers, by proposing PROVER, an interpretable transformer-based model that jointly answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. During inference, a valid proof, satisfying a set of global constraints is generated. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation, with strong generalization performance. First, PROVER generates proofs with an accuracy of 87%, while retaining or improving performance on the QA task, compared to RuleTakers (up to 6% improvement on zero-shot evaluation). Second, when trained on questions requiring lower depths of reasoning, it generalizes significantly better to higher depths (up to 15% improvement). Third, PROVER obtains near perfect QA accuracy of 98% using only 40% of the training data. However, generating proofs for questions requiring higher depths of reasoning becomes challenging, and the accuracy drops to 65% for 'depth 5', indicating significant scope for future work. Our code and models are publicly available at https://github.com/swarnaHub/PRover

Paper Structure

This paper contains 45 sections, 12 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Block diagram showing that PRover is a closer linguistic analog of formal reasoning.
  • Figure 2: Diagram showing two rule-bases with rules, facts, questions, answers and proofs. PRover answers all the questions correctly and also generates all the corresponding proofs accurately in the above scenarios.
  • Figure 3: Architecture diagram of PRover. The presence and absence of nodes/edges are labeled by 1 and 0 respectively while -100 represents masked out edges.
  • Figure 4: QA performance comparison between PRover and RuleTakers with models trained on DU0, DU1, DU2 and DU3 and tested on DU5.
  • Figure 5: Examples of proofs generated by PRover for four questions on two rule-bases about electric circuits and birds from the Birds-Electricity dataset. PRover not only answers the questions correctly but also accurately predicts the long reasoning chains with multiple branches.
  • ...and 1 more figures