Table of Contents
Fetching ...

Sequential Enumeration in Large Language Models

Kuinan Hou, Marco Zorzi, Alberto Testolin

TL;DR

The paper investigates whether state-of-the-art LLMs can perform exact sequential enumeration, using counting (naming) and production tasks across letters and five-letter words with homogeneous and heterogeneous stimuli. It examines multiple prompting strategies (explicit, spontaneous, mental, forbid) and analyzes internal representations via PCA on last-layer embeddings in a large model (Llama-70B), linking behavior to latent dynamics. Key findings show counting is reliable mainly when explicitly prompted, while spontaneous counting is rare; mental counting reveals counter-like internal dynamics, whereas explicit counting relies on surface token strategies. The results reveal a persistent gap between neural and symbolic approaches to numeracy in LLMs and highlight the need for grounding numerosity or developing architectures that support robust counting.

Abstract

Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and production tasks involving lists of letters and words, adopting a variety of prompting instructions to explore the role of chain-of-thought in the spontaneous emerging of counting strategies. We also evaluate open-source models with the same architecture but increasing size to see whether the mastering of counting principles follows scaling laws, and we analyze the embedding dynamics during sequential enumeration to investigate the emergent encoding of numerosity. We find that some LLMs are indeed capable of deploying counting procedures when explicitly prompted to do so, but none of them spontaneously engage in counting when simply asked to enumerate the number of items in a sequence. Our results suggest that, despite their impressive emergent abilities, LLMs cannot yet robustly and systematically deploy counting procedures, highlighting a persistent gap between neural and symbolic approaches to compositional generalization.

Sequential Enumeration in Large Language Models

TL;DR

The paper investigates whether state-of-the-art LLMs can perform exact sequential enumeration, using counting (naming) and production tasks across letters and five-letter words with homogeneous and heterogeneous stimuli. It examines multiple prompting strategies (explicit, spontaneous, mental, forbid) and analyzes internal representations via PCA on last-layer embeddings in a large model (Llama-70B), linking behavior to latent dynamics. Key findings show counting is reliable mainly when explicitly prompted, while spontaneous counting is rare; mental counting reveals counter-like internal dynamics, whereas explicit counting relies on surface token strategies. The results reveal a persistent gap between neural and symbolic approaches to numeracy in LLMs and highlight the need for grounding numerosity or developing architectures that support robust counting.

Abstract

Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and production tasks involving lists of letters and words, adopting a variety of prompting instructions to explore the role of chain-of-thought in the spontaneous emerging of counting strategies. We also evaluate open-source models with the same architecture but increasing size to see whether the mastering of counting principles follows scaling laws, and we analyze the embedding dynamics during sequential enumeration to investigate the emergent encoding of numerosity. We find that some LLMs are indeed capable of deploying counting procedures when explicitly prompted to do so, but none of them spontaneously engage in counting when simply asked to enumerate the number of items in a sequence. Our results suggest that, despite their impressive emergent abilities, LLMs cannot yet robustly and systematically deploy counting procedures, highlighting a persistent gap between neural and symbolic approaches to compositional generalization.

Paper Structure

This paper contains 36 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Comparison of accuracy and mean absolute error (MAE) across different models and prompting conditions. Bars represent accuracy or MAE for each model (x-axis) across the four tasks: naming/production × letter/word. Error bars indicate the variability of the estimates: for accuracy, binomial standard errors are shown; for MAE, standard errors of the mean are shown.
  • Figure 2: Scatter plots of enumeration errors (response – target value) across different prompting conditions for a selected subset of models. Each panel shows the response differences plotted against the target values, with distinct markers of different colors representing different task-stimulus combinations: blue for the naming task and green for the production task; triangles for letters and filled circles for words.
  • Figure 3: Trajectories of the first two PCs computed over hidden states, separately for the explicit and mental counting conditions. Each subplot shows the trajectories along PC1 and PC2 as a function of generation step, with colors indicating target numerosity.
  • Figure 4: 2D presentation of the PCA results. The marker’s color gradient reflects generation order across generation steps.
  • Figure 5: Population dynamics of unit activations across stepwise counting. For each condition, we plot the average activation of selected neurons within steps from 1 to 100. The activation is normalized and shaded with standard error for better visualization.
  • ...and 5 more figures