Table of Contents
Fetching ...

Certified Deductive Reasoning with Language Models

Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, Noah D. Goodman

TL;DR

This work introduces LogicGuide, a general tool-guided framework enabling language models to produce certifiably sound reasoning by interfacing with an external formal prover (Peano). By constraining the model's next steps to a finite set of valid inferences, LogicGuide dramatically improves multi-step reasoning accuracy and reduces content effects across several datasets, including PrOntoQA, ProofWriter, and Syllogism Validity, and extends to deontic reasoning via DeontiQA. The authors further show that self-improvement is possible when models are trained on certified, LogicGuide-derived traces (STaR), achieving strong gains on PrOntoQA and ReClor, even when LogicGuide is not used at inference time. Together, these results demonstrate a practical path toward integrating reliable symbolic reasoning with flexible neural models, with implications for robust AI decision-making and bootstrapping reasoning capabilities. The work highlights both the performance benefits and the safety/verification advantages of certifiable reasoning in language models.

Abstract

Language models often achieve higher accuracy when reasoning step-by-step in complex tasks. However, even when arriving at a correct final answer, their rationales are often logically unsound or inconsistent. This is a major issue when reliable reasoning traces are needed, such when fine-tuning on model-generated reasoning for self-improvement. To tackle these issues, we introduce a class of tools for language models called \emph{guides}, that use state and incremental constraints to guide generation. A guide can be invoked by the model to constrain its own generation to a set of valid statements given by the tool. In turn, the model's choices can change the guide's state. We show how a general system for logical reasoning can be used as a guide, which we call \textsc{LogicGuide}. Given a reasoning problem in natural language, a model can formalize its assumptions for \textsc{LogicGuide} and guarantee that its step-by-step reasoning is sound. In experiments on PrOntoQA, ProofWriter and Syllogism Validity datasets, \textsc{LogicGuide} significantly improves the performance of GPT-3, GPT-3.5 Turbo and LLaMA (accuracy gains up to 35\%), while drastically reducing \emph{content effects} -- the interference between unwanted prior assumptions and reasoning, which humans and language models suffer from. We then explore bootstrapping GPT-3.5 Turbo and LLaMA using their own reasoning traces. We find that LogicGuide is critical: by training only on certified self-generated reasoning, models can self-improve, avoiding learning from their own hallucinations. Moreover, bootstrapped models enjoy significant boosts on ReClor, a challenging real-world reasoning dataset, even when not relying on formalization at inference time.

Certified Deductive Reasoning with Language Models

TL;DR

This work introduces LogicGuide, a general tool-guided framework enabling language models to produce certifiably sound reasoning by interfacing with an external formal prover (Peano). By constraining the model's next steps to a finite set of valid inferences, LogicGuide dramatically improves multi-step reasoning accuracy and reduces content effects across several datasets, including PrOntoQA, ProofWriter, and Syllogism Validity, and extends to deontic reasoning via DeontiQA. The authors further show that self-improvement is possible when models are trained on certified, LogicGuide-derived traces (STaR), achieving strong gains on PrOntoQA and ReClor, even when LogicGuide is not used at inference time. Together, these results demonstrate a practical path toward integrating reliable symbolic reasoning with flexible neural models, with implications for robust AI decision-making and bootstrapping reasoning capabilities. The work highlights both the performance benefits and the safety/verification advantages of certifiable reasoning in language models.

Abstract

Language models often achieve higher accuracy when reasoning step-by-step in complex tasks. However, even when arriving at a correct final answer, their rationales are often logically unsound or inconsistent. This is a major issue when reliable reasoning traces are needed, such when fine-tuning on model-generated reasoning for self-improvement. To tackle these issues, we introduce a class of tools for language models called \emph{guides}, that use state and incremental constraints to guide generation. A guide can be invoked by the model to constrain its own generation to a set of valid statements given by the tool. In turn, the model's choices can change the guide's state. We show how a general system for logical reasoning can be used as a guide, which we call \textsc{LogicGuide}. Given a reasoning problem in natural language, a model can formalize its assumptions for \textsc{LogicGuide} and guarantee that its step-by-step reasoning is sound. In experiments on PrOntoQA, ProofWriter and Syllogism Validity datasets, \textsc{LogicGuide} significantly improves the performance of GPT-3, GPT-3.5 Turbo and LLaMA (accuracy gains up to 35\%), while drastically reducing \emph{content effects} -- the interference between unwanted prior assumptions and reasoning, which humans and language models suffer from. We then explore bootstrapping GPT-3.5 Turbo and LLaMA using their own reasoning traces. We find that LogicGuide is critical: by training only on certified self-generated reasoning, models can self-improve, avoiding learning from their own hallucinations. Moreover, bootstrapped models enjoy significant boosts on ReClor, a challenging real-world reasoning dataset, even when not relying on formalization at inference time.
Paper Structure (26 sections, 19 figures)

This paper contains 26 sections, 19 figures.

Figures (19)

  • Figure 1: A language model can invoke a guide tool, such as our LogicGuide, to perform certifiable generations. Here, when the model decides to generate an infer block, it is constrained to generate one of the formal deductions established by an external theorem-proving environment.
  • Figure 2: Example solution of gpt3.5-turbo using LogicGuide in a problem from ProofWriter. The model's generation starts at the "Formalized context". This example shows all 6 actions we implement in the LogicGuide: object declares a particular entity, prop and relation mark unary and binary predicates, respectively; axiom denotes propositions assumed to hold (possibly implications), goal sets a target to be proven or contradicted, and infer marks deductions.
  • Figure 3: Final answer accuracies with guided and unguided language models on PrOntoQA and ProofWriter, with bootstrapped 95% confidence intervals.
  • Figure 4: Accuracies of models with and without LogicGuide on the Syllogism Validity task.
  • Figure 5: Accuracy of LLaMA 13B on held-out PrOntoQA problems when bootstrapping using STaR.
  • ...and 14 more figures