Table of Contents
Fetching ...

A Closer Look at Logical Reasoning with LLMs: The Choice of Tool Matters

Long Hei Matthew Lam, Ramya Keerthy Thatikonda, Ehsan Shareghi

TL;DR

This paper investigates how the choice of symbolic solver influences the performance of LLM-augmented logical reasoning. By comparing Z3, Prover9, and Pyke across three deductive benchmarks and multiple LLMs, it demonstrates that tool choice can drive up to a $0.5$-level variation in performance and reveals a near-linear alignment between translation executable rate and accuracy for Prover9 ($r=0.98$), with distinct patterns for Z3 ($r=0.82$) and Pyke ($r=0.94$). The study finds Prover9 often offers the best translation compatibility and overall success, while Pyke struggles on more complex datasets like FOLIO; Z3 provides strong error feedback and robust performance on some tasks. The findings underscore that the integration of LLMs with symbolic solvers is highly tool-dependent, with important implications for designing robust, transparent, and scalable logic-enabled AI systems. They also highlight dataset- and domain-specific shifts (natural vs. fictional language) and the trade-offs between free-form reasoning and tool-based grounding.

Abstract

The emergence of Large Language Models (LLMs) has demonstrated promising progress in solving logical reasoning tasks effectively. Several recent approaches have proposed to change the role of the LLM from the reasoner into a translator between natural language statements and symbolic representations which are then sent to external symbolic solvers to resolve. This paradigm has established the current state-of-the-art result in logical reasoning (i.e., deductive reasoning). However, it remains unclear whether the variance in performance of these approaches stems from the methodologies employed or the specific symbolic solvers utilized. There is a lack of consistent comparison between symbolic solvers and how they influence the overall reported performance. This is important, as each symbolic solver also has its own input symbolic language, presenting varying degrees of challenge in the translation process. To address this gap, we perform experiments on 3 deductive reasoning benchmarks with LLMs augmented with widely used symbolic solvers: Z3, Pyke, and Prover9. The tool-executable rates of symbolic translation generated by different LLMs exhibit a near 50% performance variation. This highlights a significant difference in performance rooted in very basic choices of tools. The almost linear correlation between the executable rate of translations and the accuracy of the outcomes from Prover9 highlight a strong alignment between LLMs ability to translate into Prover9 symbolic language, and the correctness of those translations.

A Closer Look at Logical Reasoning with LLMs: The Choice of Tool Matters

TL;DR

This paper investigates how the choice of symbolic solver influences the performance of LLM-augmented logical reasoning. By comparing Z3, Prover9, and Pyke across three deductive benchmarks and multiple LLMs, it demonstrates that tool choice can drive up to a -level variation in performance and reveals a near-linear alignment between translation executable rate and accuracy for Prover9 (), with distinct patterns for Z3 () and Pyke (). The study finds Prover9 often offers the best translation compatibility and overall success, while Pyke struggles on more complex datasets like FOLIO; Z3 provides strong error feedback and robust performance on some tasks. The findings underscore that the integration of LLMs with symbolic solvers is highly tool-dependent, with important implications for designing robust, transparent, and scalable logic-enabled AI systems. They also highlight dataset- and domain-specific shifts (natural vs. fictional language) and the trade-offs between free-form reasoning and tool-based grounding.

Abstract

The emergence of Large Language Models (LLMs) has demonstrated promising progress in solving logical reasoning tasks effectively. Several recent approaches have proposed to change the role of the LLM from the reasoner into a translator between natural language statements and symbolic representations which are then sent to external symbolic solvers to resolve. This paradigm has established the current state-of-the-art result in logical reasoning (i.e., deductive reasoning). However, it remains unclear whether the variance in performance of these approaches stems from the methodologies employed or the specific symbolic solvers utilized. There is a lack of consistent comparison between symbolic solvers and how they influence the overall reported performance. This is important, as each symbolic solver also has its own input symbolic language, presenting varying degrees of challenge in the translation process. To address this gap, we perform experiments on 3 deductive reasoning benchmarks with LLMs augmented with widely used symbolic solvers: Z3, Pyke, and Prover9. The tool-executable rates of symbolic translation generated by different LLMs exhibit a near 50% performance variation. This highlights a significant difference in performance rooted in very basic choices of tools. The almost linear correlation between the executable rate of translations and the accuracy of the outcomes from Prover9 highlight a strong alignment between LLMs ability to translate into Prover9 symbolic language, and the correctness of those translations.
Paper Structure (30 sections, 3 figures, 4 tables)

This paper contains 30 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of syntax used for different Theorem Provers: Z3 and Prover9 adhere to the traditional first-order logic (FOL) format, while Pyke adopts a simplified formula approach, distinguishing premises into rules and facts
  • Figure 2: Executable Rate for different LLM-Tool combinations, for depth 2, 3, 5 of the ProofWriter Open World Assumption (OWA). Similar trend exists for the Close World Assumption (CWA).
  • Figure 3: The proportion of various executable and non-executable instances per each tool for GPT4o. Note, Pyke does not include FOLIO (hence 1400 instances compared to Z3 and Prover 9). The Exec w/ CorrectO, and Exec w/ IncorrectO denote Executable translations that lead to correct, and incorrect outputs once executed by the tool. The Non-exec (Parse) or (Runtime) denote the non-executable translations which are either due to parsing error or other potential runtime issues.