Table of Contents
Fetching ...

Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies

Lachlan McGinness, Peter Baumgartner

TL;DR

This work probes whether state-of-the-art LLMs can effectively follow Automated Theorem Prover (ATP) reasoning strategies on steamroller problems using the PRONTOQA framework. By prompting models with bottom-up, top-down, and magic-set inspired prompts, the authors assess both answer accuracy and the fidelity of the reasoning process, supplemented by spaCy-based natural language analysis and uncertainty estimates. A key finding is the weak link between correct reasoning steps and correct answers, with bottom-up reasoning preferred by LLMs and ATP-inspired prompts offering only modest gains at higher computational cost. The results motivate hybrid neuro-symbolic approaches that couple LLMs with external inference engines to improve explainability and reliability in deductive tasks.

Abstract

This study presents the first examination of the ability of Large Language Models (LLMs) to follow reasoning strategies that are used to guide Automated Theorem Provers (ATPs). We evaluate the performance of GPT4, GPT3.5 Turbo and Google's recent Gemini model on problems from a steamroller domain. In addition to determining accuracy we make use of the Natural Language Processing library spaCy to explore new methods of investigating LLM's reasoning capabilities. This led to one alarming result, the low correlation between correct reasoning and correct answers for any of the tested models. We found that the models' performance when using the ATP reasoning strategies was comparable to one-shot chain of thought and observe that attention to uncertainty in the accuracy results is critical when drawing conclusions about model performance. Consistent with previous speculation we confirm that LLMs have a preference for, and are best able to follow, bottom up reasoning processes. However, the reasoning strategies can still be beneficial for deriving small and relevant sets of formulas for external processing by a trusted inference engine.

Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies

TL;DR

This work probes whether state-of-the-art LLMs can effectively follow Automated Theorem Prover (ATP) reasoning strategies on steamroller problems using the PRONTOQA framework. By prompting models with bottom-up, top-down, and magic-set inspired prompts, the authors assess both answer accuracy and the fidelity of the reasoning process, supplemented by spaCy-based natural language analysis and uncertainty estimates. A key finding is the weak link between correct reasoning steps and correct answers, with bottom-up reasoning preferred by LLMs and ATP-inspired prompts offering only modest gains at higher computational cost. The results motivate hybrid neuro-symbolic approaches that couple LLMs with external inference engines to improve explainability and reliability in deductive tasks.

Abstract

This study presents the first examination of the ability of Large Language Models (LLMs) to follow reasoning strategies that are used to guide Automated Theorem Provers (ATPs). We evaluate the performance of GPT4, GPT3.5 Turbo and Google's recent Gemini model on problems from a steamroller domain. In addition to determining accuracy we make use of the Natural Language Processing library spaCy to explore new methods of investigating LLM's reasoning capabilities. This led to one alarming result, the low correlation between correct reasoning and correct answers for any of the tested models. We found that the models' performance when using the ATP reasoning strategies was comparable to one-shot chain of thought and observe that attention to uncertainty in the accuracy results is critical when drawing conclusions about model performance. Consistent with previous speculation we confirm that LLMs have a preference for, and are best able to follow, bottom up reasoning processes. However, the reasoning strategies can still be beneficial for deriving small and relevant sets of formulas for external processing by a trusted inference engine.
Paper Structure (22 sections, 5 figures, 5 tables)

This paper contains 22 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Diagram of analysis pipeline. This shows all of the quantities that were investigated in this study. The Answer Correctness blue box corresponds to the traditional analysis seen in most studies using benchmarks. Process Hard Correctness evaluates qualities of the model's reasoning process using a Boolean framework. Process Soft Correctness evaluates the model's reasoning by considering the extent of completeness and conciseness of responses.
  • Figure 2: Illustration of Reasoning Strategies. Facts are shown in red, rules in yellow and queries in green. For simplicity only very simple chains of reasoning with few or no distractors are shown. Magic Set Transformation has been used in Automated Theorem Proving and related disciplines for search space pruning.
  • Figure 3: Presence of Reasoning. The top graph shows the percentage of cases where each model showed some form of reasoning to solve the problem. The bottom graph shows the percentage of cases where the models displayed all of the steps that an ATP would go through to solve the problem.
  • Figure 4: Ratio of Process Following. This graph shows the ratio of times the model followed the instructed reasoning process out of all attempts where each model showed all required reasoning steps.
  • Figure 5: Correlation Matrices. These matrices shows the correlation between whether the model included all steps (AllSteps), included the steps in the correct order (CorrectOrder) and obtain the correct answer (CorrectAnswer).