Table of Contents
Fetching ...

Robustness of Neurosymbolic Reasoners on First-Order Logic Problems

Hannah Bansal, Kemal Kurniawan, Lea Frermann

TL;DR

The paper tackles robustness of reasoning systems to counterfactual perturbations in first-order logic tasks, comparing purely neural LLMs with neurosymbolic approaches like LINC and introducing NSCoT, which augments NS with Chain-of-Thought prompting. Across RR and the larger FOLIO dataset, LINC shows strong robustness against counterfactual changes (CF accuracy delta below 0.05), but neural models typically achieve higher overall accuracy, albeit with larger CF gaps. NSCoT improves NL→FOL translations and robustness relative to LINC, but still does not reach the performance of pure CoT-based neural methods, highlighting that NL→FOL translation remains the primary bottleneck. The findings suggest that neurosymbolic methods confer robustness benefits and that guiding NL→FOL conversion with structured reasoning can close some gaps, informing future work on translating natural language to formal representations more reliably and extending these approaches beyond FOL tasks.

Abstract

Recent trends in NLP aim to improve reasoning capabilities in Large Language Models (LLMs), with key focus on generalization and robustness to variations in tasks. Counterfactual task variants introduce minimal but semantically meaningful changes to otherwise valid first-order logic (FOL) problem instances altering a single predicate or swapping roles of constants to probe whether a reasoning system can maintain logical consistency under perturbation. Previous studies showed that LLMs becomes brittle on counterfactual variations, suggesting that they often rely on spurious surface patterns to generate responses. In this work, we explore if a neurosymbolic (NS) approach that integrates an LLM and a symbolic logical solver could mitigate this problem. Experiments across LLMs of varying sizes show that NS methods are more robust but perform worse overall that purely neural methods. We then propose NSCoT that combines an NS method and Chain-of-Thought (CoT) prompting and demonstrate that while it improves performance, NSCoT still lags behind standard CoT. Our analysis opens research directions for future work.

Robustness of Neurosymbolic Reasoners on First-Order Logic Problems

TL;DR

The paper tackles robustness of reasoning systems to counterfactual perturbations in first-order logic tasks, comparing purely neural LLMs with neurosymbolic approaches like LINC and introducing NSCoT, which augments NS with Chain-of-Thought prompting. Across RR and the larger FOLIO dataset, LINC shows strong robustness against counterfactual changes (CF accuracy delta below 0.05), but neural models typically achieve higher overall accuracy, albeit with larger CF gaps. NSCoT improves NL→FOL translations and robustness relative to LINC, but still does not reach the performance of pure CoT-based neural methods, highlighting that NL→FOL translation remains the primary bottleneck. The findings suggest that neurosymbolic methods confer robustness benefits and that guiding NL→FOL conversion with structured reasoning can close some gaps, informing future work on translating natural language to formal representations more reliably and extending these approaches beyond FOL tasks.

Abstract

Recent trends in NLP aim to improve reasoning capabilities in Large Language Models (LLMs), with key focus on generalization and robustness to variations in tasks. Counterfactual task variants introduce minimal but semantically meaningful changes to otherwise valid first-order logic (FOL) problem instances altering a single predicate or swapping roles of constants to probe whether a reasoning system can maintain logical consistency under perturbation. Previous studies showed that LLMs becomes brittle on counterfactual variations, suggesting that they often rely on spurious surface patterns to generate responses. In this work, we explore if a neurosymbolic (NS) approach that integrates an LLM and a symbolic logical solver could mitigate this problem. Experiments across LLMs of varying sizes show that NS methods are more robust but perform worse overall that purely neural methods. We then propose NSCoT that combines an NS method and Chain-of-Thought (CoT) prompting and demonstrate that while it improves performance, NSCoT still lags behind standard CoT. Our analysis opens research directions for future work.

Paper Structure

This paper contains 25 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of our data and models. We test models in their ability to reason over default and counterfactual inputs, where key nouns were swapped (top). We compare fully neural models (LLMs) with neurosymbolic methods that combine LLMs with logical solvers. In our example the neural model fails on the counterfactual input but the neurosymbolic method makes correct predictions (bottom), suggesting higher robustness. Example taken from wu2024reasoningreciting.
  • Figure 2: Comparison of the few-shot prompts in LINC (left) and NSCoT (right). In contrast to LINC, for NSCoT we pass examples that include reasoning chains between the language input and FOL translations; and instruct the model to produce a reasoning chain during generation. After this step, we pass in the generated FOLs to Prover9 for both models.
  • Figure 3: This plot shows the accuracy of LINC (blue) and NSCoT (green) on inputs with different numbers of premises (2 to 8) on the full FOLIO data. The presented results are averaged over all LLMs (as listed in \ref{['tab:methods_accuracy']}). LINC suffers a sharper decline in performance than NSCoT.
  • Figure 4: Confusion matrices for the predicted vs gold labels on the CF (left) vs Default (right) versions of RR for LINC (top) and Naïve (bottom). Predicted and ground truth labels are on the x- and y-axis respectively. The underlying LLM is Qwen2.5 7B.
  • Figure 5: Confusion matrices comparing LINC, NSCoT, CoT and Naïve for the FOLIO validation set. Predicted and ground truth labels are on the x- and y-axis respectively. The underlying LLM is Qwen2.5 7B.
  • ...and 1 more figures