Table of Contents
Fetching ...

Intermediate Languages Matter: Formal Choice Drives Neurosymbolic LLM Reasoning

Alexander Beiser, David Penz, Nysret Musliu

TL;DR

This paper investigates how the choice of the formal intermediate language influences neurosymbolic LLM reasoning. By comparing four formal languages (Pyke, ASP, NLTK, FOL) across three datasets (ProntoQA, ProofWriter, FOLIO) and six LLMs, the authors quantify the impact of syntax and encoding strategies via an extensive ablation study. They find that First-Order Logic (FOL) typically yields better reasoning performance than logic-programming languages, and that context-aware encodings improve accuracy, while comments or Markdown syntax offer no consistent benefit. The study formalizes the intermediate language challenge and demonstrates that the right formal language, together with thoughtful ICL-encoding, can significantly enhance neurosymbolic reasoning, with practical implications for deploying such systems in resource-constrained settings and guiding future research into additional languages and prompting strategies.

Abstract

Large language models (LLMs) achieve astonishing results on a wide range of tasks. However, their formal reasoning ability still lags behind. A promising approach is Neurosymbolic LLM reasoning. It works by using LLMs as translators from natural to formal languages and symbolic solvers for deriving correct results. Still, it remains unclear what the contributing factors to the success of Neurosymbolic LLM reasoning are. This paper shows that one important factor is the choice of the formal language. By comparing 4 formal languages on 3 datasets over 6 LLMs, we show that the choice of formal language affects both the syntactic and the semantic reasoning capability. Thereby, we introduce the intermediate language challenge, which is the challenge of picking a suitable formal language for neurosymbolic reasoning. Further, we compare the effects of using different in-context-learning examples in an ablation study. We conclude that on average, context-aware encodings help LLMs to reason, while there is no apparent effect of using comments or markdown syntax.

Intermediate Languages Matter: Formal Choice Drives Neurosymbolic LLM Reasoning

TL;DR

This paper investigates how the choice of the formal intermediate language influences neurosymbolic LLM reasoning. By comparing four formal languages (Pyke, ASP, NLTK, FOL) across three datasets (ProntoQA, ProofWriter, FOLIO) and six LLMs, the authors quantify the impact of syntax and encoding strategies via an extensive ablation study. They find that First-Order Logic (FOL) typically yields better reasoning performance than logic-programming languages, and that context-aware encodings improve accuracy, while comments or Markdown syntax offer no consistent benefit. The study formalizes the intermediate language challenge and demonstrates that the right formal language, together with thoughtful ICL-encoding, can significantly enhance neurosymbolic reasoning, with practical implications for deploying such systems in resource-constrained settings and guiding future research into additional languages and prompting strategies.

Abstract

Large language models (LLMs) achieve astonishing results on a wide range of tasks. However, their formal reasoning ability still lags behind. A promising approach is Neurosymbolic LLM reasoning. It works by using LLMs as translators from natural to formal languages and symbolic solvers for deriving correct results. Still, it remains unclear what the contributing factors to the success of Neurosymbolic LLM reasoning are. This paper shows that one important factor is the choice of the formal language. By comparing 4 formal languages on 3 datasets over 6 LLMs, we show that the choice of formal language affects both the syntactic and the semantic reasoning capability. Thereby, we introduce the intermediate language challenge, which is the challenge of picking a suitable formal language for neurosymbolic reasoning. Further, we compare the effects of using different in-context-learning examples in an ablation study. We conclude that on average, context-aware encodings help LLMs to reason, while there is no apparent effect of using comments or markdown syntax.

Paper Structure

This paper contains 33 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Neurosymbolic LLM reasoning: A problem formulated in natural language is translated by using in-context-learning into a formal language. Subsequently, a symbolic reasoner subsequently computes a solution to the problem, which is followed by the re-translation of the solution.
  • Figure 2: Parallel coordinates plot showing maximum overall-accuracy for the ProntoQA (left) and ProofWriter (right) datasets for all LLMs. Maximum is computed w.r.t. all ablation study scenarios per formal language.
  • Figure 3: Ablation study design about ICL-example encodings on 3 axes. Comparing context with No-C. and Text, comments with / and Comm., and markdown with / and MD. Examples shown in ASP syntax.
  • Figure 4: We show the effects of the ablation study (left), averaged across all formal languages, LLMs, and datasets and the effects of the formal languages (right), averaged across all ablation study scenarios, LLMs, and the ProntoQA and ProofWriter datasets. Error bars show the SEM, where $n=200$ (left) and $n=80$ (right).
  • Figure 5: Scatter plots comparing execution-rate to execution-accuracy for the $8$ ablation study scenarios (left) and the formal languages (right). Both are averaged across ProntoQA and ProofWriter datasets and all LLMs ($n=10$). Contour lines show overall-accuracy in steps of $10\%$.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2