Table of Contents
Fetching ...

Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

Luoxin Chen, Yichi Zhou, Huishuai Zhang

Abstract

Large language models (LLMs) have recently demonstrated impressive performance on complex, multi-step reasoning tasks, especially when post-trained with outcome-rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this unreliable reasoning, we propose PRoSFI (Process Reward over Structured Formal Intermediates), a novel reward method that enhances reasoning reliability without compromising accuracy. Instead of generating formal proofs directly, which is rarely accomplishable for a modest-sized (7B) model, the model outputs structured intermediate steps aligned with its natural language reasoning. Each step is then verified by a formal prover. Only fully validated reasoning chains receive high rewards. The integration of formal verification guides the model towards generating step-by-step machine-checkable proofs, thereby yielding more credible final answers. PRoSFI offers a simple and effective approach to training trustworthy reasoning models.

Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

Abstract

Large language models (LLMs) have recently demonstrated impressive performance on complex, multi-step reasoning tasks, especially when post-trained with outcome-rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this unreliable reasoning, we propose PRoSFI (Process Reward over Structured Formal Intermediates), a novel reward method that enhances reasoning reliability without compromising accuracy. Instead of generating formal proofs directly, which is rarely accomplishable for a modest-sized (7B) model, the model outputs structured intermediate steps aligned with its natural language reasoning. Each step is then verified by a formal prover. Only fully validated reasoning chains receive high rewards. The integration of formal verification guides the model towards generating step-by-step machine-checkable proofs, thereby yielding more credible final answers. PRoSFI offers a simple and effective approach to training trustworthy reasoning models.

Paper Structure

This paper contains 44 sections, 2 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Pipeline for Process Reward over Structured Formal Intermediates and comparison with Outcome Reward. The process begins with a problem statement (left block) containing multiple propositions (h1–h4) related to individuals (Hattie and Candy). Then, a large language model (LLM) first generates natural language reasoning steps to solve the problem (middle block). The Outcome Reward approach then directly compares the generated answer with the ground truth, assigning a reward of 1 if they match. In contrast, PRoSFI introduces an additional step by generating structured formal intermediates that capture logical dependencies, inferred conclusions, and the application of formal inference rules such as Hypothetical Syllogism and Modus Ponens. These formal representations are then submitted to a Formal Prover, which verifies the logical soundness of the reasoning. In the illustrated example, although the generated answer ("Hattie does not gain community respect" is false) is correct, the formal proof fails verification. As a result, PRoSFI assigns a low reward despite the correct outcome.
  • Figure 2: Left: GPT Soundness on ProverQA-hard subset. Right: GPT Soundness on ProverQA-extra subset. As the number of sampled paths increases, PRoSFI with PRoSFI selection consistently improves, while Outcome-CoT cannot benefit due to translation limits.
  • Figure 3: Left: Metric correlation heatmap on ProverQA-Hard. Right: Metric correlation heatmap on ProverQA-Extra. Compared to Answer Correctness, PRoSFI Reward Hit better reflects reasoning-path soundness.
  • Figure 4: Left: Confusion matrix of answer correctness vs. reasoning soundness for Outcome-CoT on the ProverQA-hard subset. Right: Corresponding results on the ProverQA-extra subset. The lack of alignment between correct answers and sound reasoning indicates that answer accuracy alone is not a sufficient criterion for judging the quality of reasoning.