Table of Contents
Fetching ...

When Do Symbolic Solvers Enhance Reasoning in Large Language Models?

Zhiyuan He, Dingmin Wang

TL;DR

This work investigates when integrating symbolic solvers with large language models is advantageous, given the overhead and potential errors of long Chain-of-Thought reasoning. By prompting LLMs to generate declarative Prolog or Python code that encodes problems and is solved by symbolic engines, the authors compare CoT with solver-based methods across GSM8K variants, ZebraLogic CSPs, and EntailmentBank, using GPT-4o and CodeLlama-13B. They find that symbolic solvers excel in constraint satisfaction and backtracking-heavy tasks, particularly with one-shot declarative exemplars, while CoT remains strong for shallow deductive reasoning. The results provide practical guidance on when to deploy symbolic-solvers in LLM-based reasoning, highlighting the value of declarative prompting to improve correctness and efficiency in solver-based workflows.

Abstract

Large Reasoning Models (LRMs) achieve strong performance on complex reasoning tasks by generating long Chains of Thought (CoTs). However, this paradigm might incur substantial token overhead, especially when models "overthink" by producing lengthy reasoning chains, which can even lead to incorrect answers. A promising direction is the symbolic-solver-integrated approach, which leverages the code generation capabilities of LLMs to translate reasoning tasks into executable code and then solve them with a symbolic solver. In this paper, we explore an open question of when the conventional long-CoT can be enhanced by symbolic solvers. Our experimental results show that the symbolic-solver-integrated method only helps when the problem requires limited implicit reasoning but involves an ample search space. The latest LLMs, like GPT-4o, show better performance on deductive problems with shallow reasoning depth, while the symbolic-solver-integrated method significantly improves the LLMs' performance in constraint satisfaction problems that require repeated backtracks. When a declarative exemplar is provided, even CodeLlama-13B can outperform GPT-4o in difficult Zebra puzzles.

When Do Symbolic Solvers Enhance Reasoning in Large Language Models?

TL;DR

This work investigates when integrating symbolic solvers with large language models is advantageous, given the overhead and potential errors of long Chain-of-Thought reasoning. By prompting LLMs to generate declarative Prolog or Python code that encodes problems and is solved by symbolic engines, the authors compare CoT with solver-based methods across GSM8K variants, ZebraLogic CSPs, and EntailmentBank, using GPT-4o and CodeLlama-13B. They find that symbolic solvers excel in constraint satisfaction and backtracking-heavy tasks, particularly with one-shot declarative exemplars, while CoT remains strong for shallow deductive reasoning. The results provide practical guidance on when to deploy symbolic-solvers in LLM-based reasoning, highlighting the value of declarative prompting to improve correctness and efficiency in solver-based workflows.

Abstract

Large Reasoning Models (LRMs) achieve strong performance on complex reasoning tasks by generating long Chains of Thought (CoTs). However, this paradigm might incur substantial token overhead, especially when models "overthink" by producing lengthy reasoning chains, which can even lead to incorrect answers. A promising direction is the symbolic-solver-integrated approach, which leverages the code generation capabilities of LLMs to translate reasoning tasks into executable code and then solve them with a symbolic solver. In this paper, we explore an open question of when the conventional long-CoT can be enhanced by symbolic solvers. Our experimental results show that the symbolic-solver-integrated method only helps when the problem requires limited implicit reasoning but involves an ample search space. The latest LLMs, like GPT-4o, show better performance on deductive problems with shallow reasoning depth, while the symbolic-solver-integrated method significantly improves the LLMs' performance in constraint satisfaction problems that require repeated backtracks. When a declarative exemplar is provided, even CodeLlama-13B can outperform GPT-4o in difficult Zebra puzzles.

Paper Structure

This paper contains 30 sections, 23 figures, 6 tables.

Figures (23)

  • Figure 1: One-shot prompt for GSM-Hard using a declarative sentence-to-sentence translation exemplar of Python code.
  • Figure 2: Examples of LLMs failing to reason about implicit information in code generation
  • Figure 3: Investigate the distribution of the number of retries needed to recover from syntax errors of the generated Prolog and Python code. We use GSM-Hard dataset, which contains 1319 problems in total, as an example.
  • Figure 4: GPT-4o with Python on GSM-Hard
  • Figure 5: Prompt with a declarative Prolog exemplar for GSM-Hard
  • ...and 18 more figures