Table of Contents
Fetching ...

Investigating Symbolic Capabilities of Large Language Models

Neisarg Dave, Daniel Kifer, C. Lee Giles, Ankur Mali

TL;DR

The paper investigates symbol-based reasoning in large language models by evaluating eight LLMs on symbol-intensive tasks anchored to Chomsky's Hierarchy, using minimally explained, zero-shot Chain-of-Thought prompts. It contrasts machine-encoding with knowledge-tuple encoding to assess encoding efficiency and uses five symbolic tasks (e.g., addition, multiplication, counting) across varying input sizes. The key finding is that performance deteriorates with increasing symbolic complexity, with fine-tuning offering limited gains, implying models rely on memorized input-output patterns rather than learned symbol manipulation. The work highlights the need for memory mechanisms and architectural changes to enable robust symbol-based reasoning in LLMs and motivates developing automata-learning capabilities at scale.

Abstract

Prompting techniques have significantly enhanced the capabilities of Large Language Models (LLMs) across various complex tasks, including reasoning, planning, and solving math word problems. However, most research has predominantly focused on language-based reasoning and word problems, often overlooking the potential of LLMs in handling symbol-based calculations and reasoning. This study aims to bridge this gap by rigorously evaluating LLMs on a series of symbolic tasks, such as addition, multiplication, modulus arithmetic, numerical precision, and symbolic counting. Our analysis encompasses eight LLMs, including four enterprise-grade and four open-source models, of which three have been pre-trained on mathematical tasks. The assessment framework is anchored in Chomsky's Hierarchy, providing a robust measure of the computational abilities of these models. The evaluation employs minimally explained prompts alongside the zero-shot Chain of Thoughts technique, allowing models to navigate the solution process autonomously. The findings reveal a significant decline in LLMs' performance on context-free and context-sensitive symbolic tasks as the complexity, represented by the number of symbols, increases. Notably, even the fine-tuned GPT3.5 exhibits only marginal improvements, mirroring the performance trends observed in other models. Across the board, all models demonstrated a limited generalization ability on these symbol-intensive tasks. This research underscores LLMs' challenges with increasing symbolic complexity and highlights the need for specialized training, memory and architectural adjustments to enhance their proficiency in symbol-based reasoning tasks.

Investigating Symbolic Capabilities of Large Language Models

TL;DR

The paper investigates symbol-based reasoning in large language models by evaluating eight LLMs on symbol-intensive tasks anchored to Chomsky's Hierarchy, using minimally explained, zero-shot Chain-of-Thought prompts. It contrasts machine-encoding with knowledge-tuple encoding to assess encoding efficiency and uses five symbolic tasks (e.g., addition, multiplication, counting) across varying input sizes. The key finding is that performance deteriorates with increasing symbolic complexity, with fine-tuning offering limited gains, implying models rely on memorized input-output patterns rather than learned symbol manipulation. The work highlights the need for memory mechanisms and architectural changes to enable robust symbol-based reasoning in LLMs and motivates developing automata-learning capabilities at scale.

Abstract

Prompting techniques have significantly enhanced the capabilities of Large Language Models (LLMs) across various complex tasks, including reasoning, planning, and solving math word problems. However, most research has predominantly focused on language-based reasoning and word problems, often overlooking the potential of LLMs in handling symbol-based calculations and reasoning. This study aims to bridge this gap by rigorously evaluating LLMs on a series of symbolic tasks, such as addition, multiplication, modulus arithmetic, numerical precision, and symbolic counting. Our analysis encompasses eight LLMs, including four enterprise-grade and four open-source models, of which three have been pre-trained on mathematical tasks. The assessment framework is anchored in Chomsky's Hierarchy, providing a robust measure of the computational abilities of these models. The evaluation employs minimally explained prompts alongside the zero-shot Chain of Thoughts technique, allowing models to navigate the solution process autonomously. The findings reveal a significant decline in LLMs' performance on context-free and context-sensitive symbolic tasks as the complexity, represented by the number of symbols, increases. Notably, even the fine-tuned GPT3.5 exhibits only marginal improvements, mirroring the performance trends observed in other models. Across the board, all models demonstrated a limited generalization ability on these symbol-intensive tasks. This research underscores LLMs' challenges with increasing symbolic complexity and highlights the need for specialized training, memory and architectural adjustments to enhance their proficiency in symbol-based reasoning tasks.
Paper Structure (16 sections, 9 theorems, 15 equations, 15 figures, 3 tables)

This paper contains 16 sections, 9 theorems, 15 equations, 15 figures, 3 tables.

Key Result

Proposition 3.1

For the addition of two base $p$ numbers with finite digits $n$ and $m$, respectively, where $n \geq m$, the encoding of their sum requires at most $(2n + m + 1) \log_2 p$ bits. This account includes the possibility of a carryover in the addition, which may increase the length of the resulting numbe

Figures (15)

  • Figure 1: Performance of LLMs on Sum of Sequence task
  • Figure 2: Performance of LLMs on Modulo 10 Arithmetic task
  • Figure 3: Performance of LLMs on Decimal Arithmetic task
  • Figure 4: Performance of LLMs on Multiplication task
  • Figure 5: Performance of LLMs on Symbolic Counter task
  • ...and 10 more figures

Theorems & Definitions (21)

  • Proposition 3.1
  • proof
  • Corollary 3.2
  • Proposition 3.3
  • proof
  • Proposition 3.4
  • proof
  • Corollary 3.5
  • proof
  • Proposition 3.6
  • ...and 11 more