Reliable Reasoning Beyond Natural Language

Nasim Borazjanizadeh; Steven T. Piantadosi

Reliable Reasoning Beyond Natural Language

Nasim Borazjanizadeh, Steven T. Piantadosi

TL;DR

The paper addresses brittle reasoning in LLMs caused by sequential next-token prediction and natural-language limitations by introducing the Non-Linear Reasoning (NLR) dataset and a neurosymbolic pipeline that offloads iterative deduction to Prolog. By translating problem information into Prolog code and using a Multiple-Try inference loop, the approach achieves substantial gains on GSM8k and Navigate and near-perfect accuracy on NLR, especially as variable interdependence increases. The work demonstrates that combining neural language understanding with symbolic reasoning yields reliable, non-linear deduction and backtracking capabilities beyond what text-only prompting can achieve. This has practical implications for deploying AI systems in domains requiring robust, interpretable reasoning across complex relational structures.

Abstract

Despite their linguistic competence, Large Language Models (LLMs) often struggle to reason reliably and flexibly. To identify these shortcomings, we introduce the Non-Linear Reasoning (NLR) dataset, a collection of 55 unique, hand-designed problems that target reasoning bottlenecks arising from the sequential prediction paradigm of LLMs and the inherently linear nature of natural language. NLR tasks require iterative updates, backtracking, and reasoning across multiple parallel chains of thought but only basic arithmetic to solve. To address these limitations, we propose a neurosymbolic reasoning approach that integrates Prolog, a symbolic reasoning engine, into the inference pipeline of LLMs. This division of labor shifts the LLM's task from iterative computations to inferring all information, explicit or implied through common sense, and encoding it as logical code. Our method yields large and robust performance gains across the GSM8k and BIG-bench Navigate benchmarks and achieves near-perfect accuracy on NLR problems, maintaining robustness even as variable interdependence - the number of other variables on which the value of a single variable depends - increases.

Reliable Reasoning Beyond Natural Language

TL;DR

Abstract

Paper Structure (10 sections, 5 figures, 2 tables)

This paper contains 10 sections, 5 figures, 2 tables.

Introduction
NLR Dataset
Our Neurosymbolic Approach
Other Works
Experiments & Results
GSM8k
Navigate Dataset
NLR Dataset
Limitations
Conclusion

Figures (5)

Figure 1: Our neurosymbolic approach: A natural language problem (for example, a math word problem from the NLR dataset) is given to an LLM, which is prompted to perform chain-of-thought (CoT) reasoning in text and logical code to encode the variable relationships as logical code statements. The Prolog interpreter executes the code. If the Prolog program fails, the LLM is re-prompted until valid code is generated or a limit of attempts is reached.
Figure 2: Comparing single-model accuracy on GSM8k and Navigate benchmarks using text-only CoT versus our neurosymbolic approach (GPT-3.5 + Prolog, GPT-4 + Prolog), using CoT in text and logical code and the Multiple-Try inference algorithm. Few-shot CoT-in-text baselines on GSM8k are reported by bubeck2023sparks and openai2023gpt.
Figure 3: Comparing single-model accuracy of LLMs (GPT-3.5, GPT-4) on the NLR dataset when prompted with text-only CoT versus our neurosymbolic approach (GPT-3.5 + Prolog, GPT-4 + Prolog), using CoT in text and logical code and the Multiple-Try inference algorithm.
Figure 4: Comparing single-model accuracy of GPT-4 using a text-only CoT prompt versus our neurosymbolic approach on a subset of NLR problems with 0–4 degrees of variable interdependence. “$k$ degree of interdependence” means at least one variable's value depends on $k$ other variables.
Figure 5: Evaluating the robustness and performance variability of our best model, GPT-4 + Prolog, by running it 25 times on each NLR problem and recording the accuracy and average number of attempts it took to generate valid code using the Multiple-Try inference algorithm.

Reliable Reasoning Beyond Natural Language

TL;DR

Abstract

Reliable Reasoning Beyond Natural Language

Authors

TL;DR

Abstract

Table of Contents

Figures (5)