Table of Contents
Fetching ...

LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers

Theo X. Olausson, Alex Gu, Benjamin Lipkin, Cedegao E. Zhang, Armando Solar-Lezama, Joshua B. Tenenbaum, Roger Levy

TL;DR

LINC reframes logical reasoning as a neurosymbolic task by using an LLM as a semantic parser to translate natural-language premises and conclusions into first-order logic, which is then solved by the Prover9 theorem prover. A K-way majority vote over multiple NL-to-FOL translations stabilizes outcomes, enabling significant performance gains over end-to-end prompting baselines on the FOLIO and ProofWriter datasets across several models. The study reveals that LINC and Chain-of-Thought have complementary failure modes, suggesting potential synergy for robust reasoning systems that combine neural and symbolic methods. Overall, the work provides compelling evidence for neurosymbolic reasoning as a scalable approach to logical inference in NLP, with publicly available code for replication and extension.

Abstract

Logical reasoning, i.e., deductively inferring the truth value of a conclusion from a set of premises, is an important task for artificial intelligence with wide potential impacts on science, mathematics, and society. While many prompting-based strategies have been proposed to enable Large Language Models (LLMs) to do such reasoning more effectively, they still appear unsatisfactory, often failing in subtle and unpredictable ways. In this work, we investigate the validity of instead reformulating such tasks as modular neurosymbolic programming, which we call LINC: Logical Inference via Neurosymbolic Computation. In LINC, the LLM acts as a semantic parser, translating premises and conclusions from natural language to expressions in first-order logic. These expressions are then offloaded to an external theorem prover, which symbolically performs deductive inference. Leveraging this approach, we observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate. On ProofWriter, augmenting the comparatively small open-source StarCoder+ (15.5B parameters) with LINC even outperforms GPT-3.5 and GPT-4 with Chain-of-Thought (CoT) prompting by an absolute 38% and 10%, respectively. When used with GPT-4, LINC scores 26% higher than CoT on ProofWriter while performing comparatively on FOLIO. Further analysis reveals that although both methods on average succeed roughly equally often on this dataset, they exhibit distinct and complementary failure modes. We thus provide promising evidence for how logical reasoning over natural language can be tackled through jointly leveraging LLMs alongside symbolic provers. All corresponding code is publicly available at https://github.com/benlipkin/linc

LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers

TL;DR

LINC reframes logical reasoning as a neurosymbolic task by using an LLM as a semantic parser to translate natural-language premises and conclusions into first-order logic, which is then solved by the Prover9 theorem prover. A K-way majority vote over multiple NL-to-FOL translations stabilizes outcomes, enabling significant performance gains over end-to-end prompting baselines on the FOLIO and ProofWriter datasets across several models. The study reveals that LINC and Chain-of-Thought have complementary failure modes, suggesting potential synergy for robust reasoning systems that combine neural and symbolic methods. Overall, the work provides compelling evidence for neurosymbolic reasoning as a scalable approach to logical inference in NLP, with publicly available code for replication and extension.

Abstract

Logical reasoning, i.e., deductively inferring the truth value of a conclusion from a set of premises, is an important task for artificial intelligence with wide potential impacts on science, mathematics, and society. While many prompting-based strategies have been proposed to enable Large Language Models (LLMs) to do such reasoning more effectively, they still appear unsatisfactory, often failing in subtle and unpredictable ways. In this work, we investigate the validity of instead reformulating such tasks as modular neurosymbolic programming, which we call LINC: Logical Inference via Neurosymbolic Computation. In LINC, the LLM acts as a semantic parser, translating premises and conclusions from natural language to expressions in first-order logic. These expressions are then offloaded to an external theorem prover, which symbolically performs deductive inference. Leveraging this approach, we observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate. On ProofWriter, augmenting the comparatively small open-source StarCoder+ (15.5B parameters) with LINC even outperforms GPT-3.5 and GPT-4 with Chain-of-Thought (CoT) prompting by an absolute 38% and 10%, respectively. When used with GPT-4, LINC scores 26% higher than CoT on ProofWriter while performing comparatively on FOLIO. Further analysis reveals that although both methods on average succeed roughly equally often on this dataset, they exhibit distinct and complementary failure modes. We thus provide promising evidence for how logical reasoning over natural language can be tackled through jointly leveraging LLMs alongside symbolic provers. All corresponding code is publicly available at https://github.com/benlipkin/linc
Paper Structure (27 sections, 1 equation, 7 figures)

This paper contains 27 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: This figure showcases the essence of our approach. Starting from a problem in natural language, in Step 1, the LLM semantic parser samples logic formulas expressing estimates of the semantics. It is possible that some of these might contain errors, e.g., the second example shows a syntax error involving an extra parenthesis, whereas the fourth example highlights a semantic error caused by mismatched predicates. In Step 2, these are then each offloaded to an automated theorem prover, filtering out syntax errors, and producing labels for the remaining samples. In Step 3, the remaining candidate outputs are passed through a majority-vote sieve to arrive at the best estimate for a single output label.
  • Figure 2: This figure outlines the string concatenation workflow for each of our conditions. We start with the original problem, provide ICL examples through an intermediate markup language, and finally append the problem to evaluate. At this stage, we allow the model to autoregressively sample until producing a stop token.
  • Figure 3: Results of each model on the FOLIO and ProofWriter datasets. Accuracies are for bootstrapped $10$-way majority vote for all models. Error bars are $\pm 1$ bootstrapped standard deviation. Dotted, black line is the accuracy obtained by always guessing the most common label in the dataset.
  • Figure 4: Accuracy per necessary proof depth in ProofWriter. Accuracies reported are for bootstrapped $10$-way majority vote, and shaded areas cover $\pm 1$ bootstrapped standard deviation. Black, dotted lines reflect the expected success rate of guessing a random label, which is 1/3 in all subsets per our experiment design.
  • Figure 5: Analyzing and comparing the mistakes made by GPT-4 on the FOLIO dataset.
  • ...and 2 more figures