Table of Contents
Fetching ...

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

Chengwen Qi, Ren Ma, Bowen Li, He Du, Binyuan Hui, Jinwang Wu, Yuanjun Laili, Conghui He

TL;DR

ProverGen introduces ProverQA, a challenging first-order logic reasoning benchmark generated by a three-stage pipeline that combines LLMs with a symbolic prover to ensure logical coherence. The framework yields a 1,500-instance dataset across easy, medium, and hard levels, with natural-language explanations and robust distraction mechanisms to test reasoning fidelity. Experimental results reveal that even strong LLMs struggle on hard ProverQA items, and finetuning on ProverQA data yields substantial improvements in both in-distribution and out-of-distribution settings. The work provides open-source tooling for reproducible data generation and demonstrates the potential of integrating symbolic reasoning with data-generation pipelines to advance logical reasoning evaluation and model robustness.

Abstract

First-order logic (FOL) reasoning, which involves sequential deduction, is pivotal for intelligent systems and serves as a valuable task for evaluating reasoning capabilities, particularly in chain-of-thought (CoT) contexts. Existing benchmarks often rely on extensive human annotation or handcrafted templates, making it difficult to achieve the necessary complexity, scalability, and diversity for robust evaluation. To address these limitations, we propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, ProverQA. ProverQA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. Our evaluation shows that state-of-the-art LLMs struggle to solve ProverQA problems, even with CoT prompting, highlighting the dataset's challenging nature. We also finetune Llama3.1-8B-Instruct on a separate training set generated by our framework. The finetuned model demonstrates consistent improvements on both in-distribution and out-of-distribution test sets, suggesting the value of our proposed data generation framework. Code available at: https://github.com/opendatalab/ProverGen

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

TL;DR

ProverGen introduces ProverQA, a challenging first-order logic reasoning benchmark generated by a three-stage pipeline that combines LLMs with a symbolic prover to ensure logical coherence. The framework yields a 1,500-instance dataset across easy, medium, and hard levels, with natural-language explanations and robust distraction mechanisms to test reasoning fidelity. Experimental results reveal that even strong LLMs struggle on hard ProverQA items, and finetuning on ProverQA data yields substantial improvements in both in-distribution and out-of-distribution settings. The work provides open-source tooling for reproducible data generation and demonstrates the potential of integrating symbolic reasoning with data-generation pipelines to advance logical reasoning evaluation and model robustness.

Abstract

First-order logic (FOL) reasoning, which involves sequential deduction, is pivotal for intelligent systems and serves as a valuable task for evaluating reasoning capabilities, particularly in chain-of-thought (CoT) contexts. Existing benchmarks often rely on extensive human annotation or handcrafted templates, making it difficult to achieve the necessary complexity, scalability, and diversity for robust evaluation. To address these limitations, we propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, ProverQA. ProverQA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. Our evaluation shows that state-of-the-art LLMs struggle to solve ProverQA problems, even with CoT prompting, highlighting the dataset's challenging nature. We also finetune Llama3.1-8B-Instruct on a separate training set generated by our framework. The finetuned model demonstrates consistent improvements on both in-distribution and out-of-distribution test sets, suggesting the value of our proposed data generation framework. Code available at: https://github.com/opendatalab/ProverGen

Paper Structure

This paper contains 33 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: An overview of our ProverGen framework. (a) Background Story Generation (Section \ref{['subsubsec:background']}). Given a subject, Sawyer, and a seed keyword, elegant, LLMs generate a background story to establish context and ensure linguistic diversity. (b) Logic Skeleton Generation (Section \ref{['subsubsec:premise_generation']}). A top-down approach is used to generate the logic skeleton, forming the reasoning tree for the FOL problem. This involves two iterative steps: expression sampling and truth value calculation using the Prover9 prover. Distractions are also incorporated to test the robustness of model's reasoning capabilities. (c) Statement Translation (Section \ref{['subsubsec:translation']}). LLMs translate the facts and rules from the logic skeleton into natural language, guided by the previously generated background story. Each rule is translated into a universal version and a specific version. The universal version is preferred if it does not contradict with common sense, otherwise the specific one is selected.
  • Figure 2: A generated FOL problem (a) with its corresponding reasoning process (b).