Table of Contents
Fetching ...

Towards Learning to Reason: Comparing LLMs with Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

Michael Hersche, Giacomo Camposampiero, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi

TL;DR

The paper evaluates abstract reasoning in Raven's Progressive Matrices by contrasting GPT-4 and Llama-3 70B with a neuro-symbolic approach called ARLC that uses vector-symbolic architectures. By providing oracle perception to isolate reasoning, the study reveals that LLMs struggle to execute arithmetic rules, especially as task size and value range grow. ARLC, which encodes attributes with distributed, similarity-preserving vectors and learns rules as differentiable assignments, achieves near-perfect accuracy on the center I-RAVEN constellation and generalizes to the larger I-RAVEN-X benchmark with minimal retraining. Overall, the results highlight strong arithmetic reasoning gains from neuro-symbolic VSAs and point to integrating symbolic solvers with LLMs to improve visual–abstract reasoning capabilities in RPM-like tasks.

Abstract

This work compares large language models (LLMs) and neuro-symbolic approaches in solving Raven's progressive matrices (RPM), a visual abstract reasoning test that involves the understanding of mathematical rules such as progression or arithmetic addition. Providing the visual attributes directly as textual prompts, which assumes an oracle visual perception module, allows us to measure the model's abstract reasoning capability in isolation. Despite providing such compositionally structured representations from the oracle visual perception and advanced prompting techniques, both GPT-4 and Llama-3 70B cannot achieve perfect accuracy on the center constellation of the I-RAVEN dataset. Our analysis reveals that the root cause lies in the LLM's weakness in understanding and executing arithmetic rules. As a potential remedy, we analyze the Abductive Rule Learner with Context-awareness (ARLC), a neuro-symbolic approach that learns to reason with vector-symbolic architectures (VSAs). Here, concepts are represented with distributed vectors s.t. dot products between encoded vectors define a similarity kernel, and simple element-wise operations on the vectors perform addition/subtraction on the encoded values. We find that ARLC achieves almost perfect accuracy on the center constellation of I-RAVEN, demonstrating a high fidelity in arithmetic rules. To stress the length generalization capabilities of the models, we extend the RPM tests to larger matrices (3x10 instead of typical 3x3) and larger dynamic ranges of the attribute values (from 10 up to 1000). We find that the LLM's accuracy of solving arithmetic rules drops to sub-10%, especially as the dynamic range expands, while ARLC can maintain a high accuracy due to emulating symbolic computations on top of properly distributed representations. Our code is available at https://github.com/IBM/raven-large-language-models.

Towards Learning to Reason: Comparing LLMs with Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

TL;DR

The paper evaluates abstract reasoning in Raven's Progressive Matrices by contrasting GPT-4 and Llama-3 70B with a neuro-symbolic approach called ARLC that uses vector-symbolic architectures. By providing oracle perception to isolate reasoning, the study reveals that LLMs struggle to execute arithmetic rules, especially as task size and value range grow. ARLC, which encodes attributes with distributed, similarity-preserving vectors and learns rules as differentiable assignments, achieves near-perfect accuracy on the center I-RAVEN constellation and generalizes to the larger I-RAVEN-X benchmark with minimal retraining. Overall, the results highlight strong arithmetic reasoning gains from neuro-symbolic VSAs and point to integrating symbolic solvers with LLMs to improve visual–abstract reasoning capabilities in RPM-like tasks.

Abstract

This work compares large language models (LLMs) and neuro-symbolic approaches in solving Raven's progressive matrices (RPM), a visual abstract reasoning test that involves the understanding of mathematical rules such as progression or arithmetic addition. Providing the visual attributes directly as textual prompts, which assumes an oracle visual perception module, allows us to measure the model's abstract reasoning capability in isolation. Despite providing such compositionally structured representations from the oracle visual perception and advanced prompting techniques, both GPT-4 and Llama-3 70B cannot achieve perfect accuracy on the center constellation of the I-RAVEN dataset. Our analysis reveals that the root cause lies in the LLM's weakness in understanding and executing arithmetic rules. As a potential remedy, we analyze the Abductive Rule Learner with Context-awareness (ARLC), a neuro-symbolic approach that learns to reason with vector-symbolic architectures (VSAs). Here, concepts are represented with distributed vectors s.t. dot products between encoded vectors define a similarity kernel, and simple element-wise operations on the vectors perform addition/subtraction on the encoded values. We find that ARLC achieves almost perfect accuracy on the center constellation of I-RAVEN, demonstrating a high fidelity in arithmetic rules. To stress the length generalization capabilities of the models, we extend the RPM tests to larger matrices (3x10 instead of typical 3x3) and larger dynamic ranges of the attribute values (from 10 up to 1000). We find that the LLM's accuracy of solving arithmetic rules drops to sub-10%, especially as the dynamic range expands, while ARLC can maintain a high accuracy due to emulating symbolic computations on top of properly distributed representations. Our code is available at https://github.com/IBM/raven-large-language-models.

Paper Structure

This paper contains 28 sections, 10 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: This work compares the abstract reasoning capabilities of llm and neuro-symbolic ARLC on Raven's progressive matrices (RPM) tests. a) An RPM example taken from the center constellation of I-RAVEN. The task is to find the empty panel at the bottom-right of the context matrix by selecting one of the answer candidates. b) Solving RPMs through LLM prompting. Visual attribute values are extracted from the I-RAVEN dataset and assembled to individual per-attribute text-only prompts. llm are prompted to predict the attribute of the empty panel. Finally, the attribute predictions are compared with the answer candidates, whereby the best-matching answer is selected as the final answer. c) Solving RPMs with neuro-symbolic ARLC that relies on distributed similarity-preserving representations and manipulates them via dimensionality-preserving operations; it learns rule-formulations as a differentiable assignment problem.
  • Figure 2: a) Individual per-attribute text-only prompts to solve rpm tasks from I-RAVEN. b) Example prompts with of our novel configurable I-RAVEN-X dataset of size 3$\times$10 with a value range of $m=1000$. In both the I-RAVEN and I-RAVEN-X examples, the llm (GPT-4) errs in the arithmetic rules.
  • Figure 3: ARLC architecture. ARLC maps attribute values, or distributions of values, to distributed VSA representations, where the semantic similarity between values is preserved via a notion of kernel. Learnable rules ($r_1, ..., r_R$) predict the VSA representation of the empty panel ($\hat{\mathbf{v}}_{a, r}^{(3,3)}$) together with a confidence value ($s_r$). The closest answer to the predicted soft-selected prediction ($\hat{\mathbf{v}}_a^{(3,3)}$) is chosen as the final answer.
  • Figure 4: Similarity kernel in VSA. Mapping two values ($v_1$ and $v_2$) to a VSA space (i.e., GSBC in ARLC) that uses fractional power encoding (FPE) and computing their similarity in the VSA space yields the shown similarity kernel $K(v_1-v_2)$.
  • Figure 5: Visualization of current samples ($X=\{x_1,x_2\}$, in yellow) and context ($O=\{o_1,\dots,o_5\}$, in green) panels when predicting the third panel for different rows, namely the first row (left), second row (center) and third row (right). Black objects represent panels that are not used for the computation, while the question mark represents the unknown test panel, which is unavailable during inference.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1: VSA