Table of Contents
Fetching ...

Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models

Flavio Petruzzellis, Alberto Testolin, Alessandro Sperduti

TL;DR

This study systematically probes emergent symbolic reasoning in open-source Llama 2 family models by evaluating Llama 2 Chat and two fine-tuned variants (MetaMath, MAmmoTH) on synthetic symbolic benchmarks with controllable difficulty, specifically ListOps and arithmetic formulas with nesting up to four levels. Results show that increasing model size and domain-focused fine-tuning improve performance, but gains are mainly on low-complexity formulas, with substantial limitations when handling deeper compositional structure and tricky operations like modulo with negative operands. The work highlights that current open-source LLMs exhibit limited emergent symbolic reasoning, even at 70B parameters, and underscores the need for architectures better suited to symbolic computation. The proposed evaluation framework offers a precise, scalable way to quantify symbolic reasoning capabilities across model scales and training regimes, aiding future research and benchmarking efforts.

Abstract

Large Language Models (LLMs) achieve impressive performance in a wide range of tasks, even if they are often trained with the only objective of chatting fluently with users. Among other skills, LLMs show emergent abilities in mathematical reasoning benchmarks, which can be elicited with appropriate prompting methods. In this work, we systematically investigate the capabilities and limitations of popular open-source LLMs on different symbolic reasoning tasks. We evaluate three models of the Llama 2 family on two datasets that require solving mathematical formulas of varying degrees of difficulty. We test a generalist LLM (Llama 2 Chat) as well as two fine-tuned versions of Llama 2 (MAmmoTH and MetaMath) specifically designed to tackle mathematical problems. We observe that both increasing the scale of the model and fine-tuning it on relevant tasks lead to significant performance gains. Furthermore, using fine-grained evaluation measures, we find that such performance gains are mostly observed with mathematical formulas of low complexity, which nevertheless often remain challenging even for the largest fine-tuned models.

Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models

TL;DR

This study systematically probes emergent symbolic reasoning in open-source Llama 2 family models by evaluating Llama 2 Chat and two fine-tuned variants (MetaMath, MAmmoTH) on synthetic symbolic benchmarks with controllable difficulty, specifically ListOps and arithmetic formulas with nesting up to four levels. Results show that increasing model size and domain-focused fine-tuning improve performance, but gains are mainly on low-complexity formulas, with substantial limitations when handling deeper compositional structure and tricky operations like modulo with negative operands. The work highlights that current open-source LLMs exhibit limited emergent symbolic reasoning, even at 70B parameters, and underscores the need for architectures better suited to symbolic computation. The proposed evaluation framework offers a precise, scalable way to quantify symbolic reasoning capabilities across model scales and training regimes, aiding future research and benchmarking efforts.

Abstract

Large Language Models (LLMs) achieve impressive performance in a wide range of tasks, even if they are often trained with the only objective of chatting fluently with users. Among other skills, LLMs show emergent abilities in mathematical reasoning benchmarks, which can be elicited with appropriate prompting methods. In this work, we systematically investigate the capabilities and limitations of popular open-source LLMs on different symbolic reasoning tasks. We evaluate three models of the Llama 2 family on two datasets that require solving mathematical formulas of varying degrees of difficulty. We test a generalist LLM (Llama 2 Chat) as well as two fine-tuned versions of Llama 2 (MAmmoTH and MetaMath) specifically designed to tackle mathematical problems. We observe that both increasing the scale of the model and fine-tuning it on relevant tasks lead to significant performance gains. Furthermore, using fine-grained evaluation measures, we find that such performance gains are mostly observed with mathematical formulas of low complexity, which nevertheless often remain challenging even for the largest fine-tuned models.
Paper Structure (11 sections, 4 figures)

This paper contains 11 sections, 4 figures.

Figures (4)

  • Figure 1: Average accuracy on the ListOps and Arithmetic tasks obtained by Llama 2, MAmmoTH, and MetaMath models of increasing size. Larger models (especially if fine-tuned on math tasks) achieve better performance than smaller ones.
  • Figure 2: Accuracy of Llama 2, MAmmoTH and MetaMath models on ListOps (top) and arithmetic (bottom) formulas of varying levels of difficulty as a function of model size. N$k$ indicates formulas with nesting level $k$.
  • Figure 3: Type of errors made by the models on ListOps formulas with a single nesting level. Absolute number of errors is on the $y$-axes and operator used in the formula is on the $x$-axis.
  • Figure 4: Type of errors made by the models on arithmetic formulas with nesting level 1, grouped by operator (+,-,$\ast$) and sign of the result ( y). Incidence of errors is measured by group.