Table of Contents
Fetching ...

Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships

D. Panas, S. Seth, V. Belle

TL;DR

The paper investigates whether large language models can infer entailed arithmetical relationships from implicitly stored knowledge. It introduces Entailed Arithmetic Relationship (EAR) probing, pairing atomic numerical facts with relational inequalities and validating outputs via a Python symbolic solver to distinguish genuine reasoning from statistical inference. Results show that models like GPT-4 improve on raw numerical facts but exhibit instability, prompt sensitivity, and limited entailment—often reflecting memorized patterns rather than true reasoning. The findings argue that non-neuro-symbolic LLMs function more as statistical search engines for arithmetic tasks, highlighting the need for neuro-symbolic integrations or external solvers to achieve reliable arithmetic reasoning in practice.

Abstract

Two major areas of interest in the era of Large Language Models regard questions of what do LLMs know, and if and how they may be able to reason (or rather, approximately reason). Since to date these lines of work progressed largely in parallel (with notable exceptions), we are interested in investigating the intersection: probing for reasoning about the implicitly-held knowledge. Suspecting the performance to be lacking in this area, we use a very simple set-up of comparisons between cardinalities associated with elements of various subjects (e.g. the number of legs a bird has versus the number of wheels on a tricycle). We empirically demonstrate that although LLMs make steady progress in knowledge acquisition and (pseudo)reasoning with each new GPT release, their capabilities are limited to statistical inference only. It is difficult to argue that pure statistical learning can cope with the combinatorial explosion inherent in many commonsense reasoning tasks, especially once arithmetical notions are involved. Further, we argue that bigger is not always better and chasing purely statistical improvements is flawed at the core, since it only exacerbates the dangerous conflation of the production of correct answers with genuine reasoning ability.

Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships

TL;DR

The paper investigates whether large language models can infer entailed arithmetical relationships from implicitly stored knowledge. It introduces Entailed Arithmetic Relationship (EAR) probing, pairing atomic numerical facts with relational inequalities and validating outputs via a Python symbolic solver to distinguish genuine reasoning from statistical inference. Results show that models like GPT-4 improve on raw numerical facts but exhibit instability, prompt sensitivity, and limited entailment—often reflecting memorized patterns rather than true reasoning. The findings argue that non-neuro-symbolic LLMs function more as statistical search engines for arithmetic tasks, highlighting the need for neuro-symbolic integrations or external solvers to achieve reliable arithmetic reasoning in practice.

Abstract

Two major areas of interest in the era of Large Language Models regard questions of what do LLMs know, and if and how they may be able to reason (or rather, approximately reason). Since to date these lines of work progressed largely in parallel (with notable exceptions), we are interested in investigating the intersection: probing for reasoning about the implicitly-held knowledge. Suspecting the performance to be lacking in this area, we use a very simple set-up of comparisons between cardinalities associated with elements of various subjects (e.g. the number of legs a bird has versus the number of wheels on a tricycle). We empirically demonstrate that although LLMs make steady progress in knowledge acquisition and (pseudo)reasoning with each new GPT release, their capabilities are limited to statistical inference only. It is difficult to argue that pure statistical learning can cope with the combinatorial explosion inherent in many commonsense reasoning tasks, especially once arithmetical notions are involved. Further, we argue that bigger is not always better and chasing purely statistical improvements is flawed at the core, since it only exacerbates the dangerous conflation of the production of correct answers with genuine reasoning ability.
Paper Structure (15 sections, 3 figures, 1 table)

This paper contains 15 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: (upper-left) Data is constructed as subject-element pairs with known associated ground truth values from which many combinations of subject-element-subject-element quad-tuples can be generated, assuming 0 for missing parts. (upper-right) For each quad-tuple we can assign the ground truth relationship string, since we know the atomic facts. (lower-left) From these pairs and quad-tuples, prompts for numerical probing and entailed relationship probing are generated by populating templates. (lower-right) To evaluate Entailment, once numerical answers are given by an LLM, the relationship 'reasoned' by an LLM (orange) can be compared to an exact reasoning carried out in Python (black).
  • Figure 2: Proportion of correct answers on numerical facts, by subject. For each subject the same set of elements was probed, e.g. number of legs, tyres, fins etc.
  • Figure 3: Number of elements predicted by GPT-3.5 in a 0-shot setting with fact formulation A, by element, for selected 3 illustrative subjects: a unicycle, a bicycle and a tricycle. For majority of cases, answers are biased towards the defining characteristic number of each subject; notable exception is fingers on each hand and toes on each foot.