Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models
Flavio Petruzzellis, Alberto Testolin, Alessandro Sperduti
TL;DR
This study systematically probes emergent symbolic reasoning in open-source Llama 2 family models by evaluating Llama 2 Chat and two fine-tuned variants (MetaMath, MAmmoTH) on synthetic symbolic benchmarks with controllable difficulty, specifically ListOps and arithmetic formulas with nesting up to four levels. Results show that increasing model size and domain-focused fine-tuning improve performance, but gains are mainly on low-complexity formulas, with substantial limitations when handling deeper compositional structure and tricky operations like modulo with negative operands. The work highlights that current open-source LLMs exhibit limited emergent symbolic reasoning, even at 70B parameters, and underscores the need for architectures better suited to symbolic computation. The proposed evaluation framework offers a precise, scalable way to quantify symbolic reasoning capabilities across model scales and training regimes, aiding future research and benchmarking efforts.
Abstract
Large Language Models (LLMs) achieve impressive performance in a wide range of tasks, even if they are often trained with the only objective of chatting fluently with users. Among other skills, LLMs show emergent abilities in mathematical reasoning benchmarks, which can be elicited with appropriate prompting methods. In this work, we systematically investigate the capabilities and limitations of popular open-source LLMs on different symbolic reasoning tasks. We evaluate three models of the Llama 2 family on two datasets that require solving mathematical formulas of varying degrees of difficulty. We test a generalist LLM (Llama 2 Chat) as well as two fine-tuned versions of Llama 2 (MAmmoTH and MetaMath) specifically designed to tackle mathematical problems. We observe that both increasing the scale of the model and fine-tuning it on relevant tasks lead to significant performance gains. Furthermore, using fine-grained evaluation measures, we find that such performance gains are mostly observed with mathematical formulas of low complexity, which nevertheless often remain challenging even for the largest fine-tuned models.
