Table of Contents
Fetching ...

PCRLLM: Proof-Carrying Reasoning with Large Language Models under Stepwise Logical Constraints

Tangrui Li, Pei Wang, Hongzheng Wang Christian Hahm, Matteo Spatola, Justin Shi

TL;DR

PCRLLM addresses trustworthiness and logical coherence in large language models by enforcing single-step inferences and producing proof-carrying outputs that encode premises, rules, and conclusions for automatic verification. By grounding reasoning in Non-Axiomatic Logic (NAL) and maintaining natural-language problem descriptions, the method enables chain-level validation and systematic multi-LLM collaboration through formal rule integration. The authors introduce a large-scale, step-level reasoning benchmark and data-generation schema to address formal-reasoning data scarcity. Experimental results on variants of the Qwen 2.5 family show improvements in formal conformity through priming and stepwise recomposition, while highlighting challenges in formatting, scalability, and premise attribution that shape future work.

Abstract

Large Language Models (LLMs) often exhibit limited logical coherence, mapping premises to conclusions without adherence to explicit inference rules. We propose Proof-Carrying Reasoning with LLMs (PCRLLM), a framework that constrains reasoning to single-step inferences while preserving natural language formulations. Each output explicitly specifies premises, rules, and conclusions, thereby enabling verification against a target logic. This mechanism mitigates trustworthiness concerns by supporting chain-level validation even in black-box settings. Moreover, PCRLLM facilitates systematic multi-LLM collaboration, allowing intermediate steps to be compared and integrated under formal rules. Finally, we introduce a benchmark schema for generating large-scale step-level reasoning data, combining natural language expressiveness with formal rigor.

PCRLLM: Proof-Carrying Reasoning with Large Language Models under Stepwise Logical Constraints

TL;DR

PCRLLM addresses trustworthiness and logical coherence in large language models by enforcing single-step inferences and producing proof-carrying outputs that encode premises, rules, and conclusions for automatic verification. By grounding reasoning in Non-Axiomatic Logic (NAL) and maintaining natural-language problem descriptions, the method enables chain-level validation and systematic multi-LLM collaboration through formal rule integration. The authors introduce a large-scale, step-level reasoning benchmark and data-generation schema to address formal-reasoning data scarcity. Experimental results on variants of the Qwen 2.5 family show improvements in formal conformity through priming and stepwise recomposition, while highlighting challenges in formatting, scalability, and premise attribution that shape future work.

Abstract

Large Language Models (LLMs) often exhibit limited logical coherence, mapping premises to conclusions without adherence to explicit inference rules. We propose Proof-Carrying Reasoning with LLMs (PCRLLM), a framework that constrains reasoning to single-step inferences while preserving natural language formulations. Each output explicitly specifies premises, rules, and conclusions, thereby enabling verification against a target logic. This mechanism mitigates trustworthiness concerns by supporting chain-level validation even in black-box settings. Moreover, PCRLLM facilitates systematic multi-LLM collaboration, allowing intermediate steps to be compared and integrated under formal rules. Finally, we introduce a benchmark schema for generating large-scale step-level reasoning data, combining natural language expressiveness with formal rigor.

Paper Structure

This paper contains 26 sections, 7 figures.

Figures (7)

  • Figure 1: First-order NAL Syllogism.
  • Figure 2: The JSON structure overview of the expected LLM response.
  • Figure 3: The overview of the prompt structure.
  • Figure 4: Ratio of valid answers under five priming strategies. Model: Qwen2.5--0.5B; fine-tuned for 1 epoch. Training set: 100 instances (seed = 39). Dashed curves denote JSON-repaired results.
  • Figure 5: Ratio of valid answers under five priming strategies. Model: Qwen2.5--1.5B; fine-tuned for 2 epochs. Training set: 100 instances (seed = 39). Dashed curves denote JSON-repaired results.
  • ...and 2 more figures