Working Memory Capacity of ChatGPT: An Empirical Study

Dongyu Gong; Xingchen Wan; Dingmin Wang

Working Memory Capacity of ChatGPT: An Empirical Study

Dongyu Gong, Xingchen Wan, Dingmin Wang

TL;DR

This paper empirically assesses the working memory capacity of ChatGPT using verbal and spatial $n$-back tasks across multiple conditions. By measuring $d'$ across $n \\in\\\{1,2,3\\}$ and employing variants with noise, feedback, and reasoning prompts, the authors reveal a human-like capacity limit near $n=3$, with performance deteriorating under distraction and cognitive load. The study further compares several LLMs, finding GPT-4 to exhibit the strongest WM capacity, while open-source models show markedly lower capacity. The results suggest $n$-back tasks as effective benchmarks for WM in LLMs and highlight potential architecture-driven constraints, motivating future work to enhance AI working memory and its role in general intelligence.

Abstract

Working memory is a critical aspect of both human intelligence and artificial intelligence, serving as a workspace for the temporary storage and manipulation of information. In this paper, we systematically assess the working memory capacity of ChatGPT, a large language model developed by OpenAI, by examining its performance in verbal and spatial n-back tasks under various conditions. Our experiments reveal that ChatGPT has a working memory capacity limit strikingly similar to that of humans. Furthermore, we investigate the impact of different instruction strategies on ChatGPT's performance and observe that the fundamental patterns of a capacity limit persist. From our empirical findings, we propose that n-back tasks may serve as tools for benchmarking the working memory capacity of large language models and hold potential for informing future efforts aimed at enhancing AI working memory.

Working Memory Capacity of ChatGPT: An Empirical Study

TL;DR

This paper empirically assesses the working memory capacity of ChatGPT using verbal and spatial

-back tasks across multiple conditions. By measuring

across

and employing variants with noise, feedback, and reasoning prompts, the authors reveal a human-like capacity limit near

, with performance deteriorating under distraction and cognitive load. The study further compares several LLMs, finding GPT-4 to exhibit the strongest WM capacity, while open-source models show markedly lower capacity. The results suggest

-back tasks as effective benchmarks for WM in LLMs and highlight potential architecture-driven constraints, motivating future work to enhance AI working memory and its role in general intelligence.

Abstract

Paper Structure (12 sections, 29 figures, 12 tables)

This paper contains 12 sections, 29 figures, 12 tables.

Introduction
Related Work
Methods
Verbal n-back experiments.
Spatial n-back experiments.
Results
Verbal n-back experiments.
Spatial n-back experiments.
Model comparison.
Discussion
Statistical Tests
Performance Distributions

Figures (29)

Figure 1: Illustrations of verbal (top row) and spatial (bottom row) n-back tasks with $n = \{1, 2, 3\}$. Participants are instructed to give a response ("m") when the current stimulus (e.g., a letter or a spatial location) is the same as the stimulus n trial(s) ago, and not respond ("-") on nonmatch trials.
Figure 2: Typical human performance in n-back tasks for $n=\{1, 2, 3\}$. We plot the mean $\pm1$ standard deviation of the data collected in jaeggi2010concurrent.
Figure 3: Illustrations of the variants of verbal n-back tasks (also applicable to spatial tasks). We use $n=2$ in the figure. (a): base version identical to the case presented in Figure \ref{['fig:n_back']} (top row); (b): stimulus on each trial now contains 3-6 random noise characters (chosen from "#$%&@") in addition to a single alphabetical letter that the LLM should compare across trials. The LLM is instructed to ignore these noise characters, and the alphabetical letter may appear in any position in the noise-corrupted stimulus; (c): alongside the input for every trial, the LLM is also provided with feedback on whether it has performed the previous trial correctly; (d): the LLM is prompted with a reasoning-eliciting instruction to output the final answer ("m" or "-") and the rationale. Refer to Table \ref{['tab:verbal_prompts']} for the detailed prompts.
Figure 4: Illustrations of the variants of spatial n-back tasks ($n=2$ in the figure) besides those presented in Figure \ref{['fig:verbal_schematics']}. (a): base version identical to the case presented in Figure \ref{['fig:n_back']} (bottom row); (b): spatial tasks with larger grid sizes ($4 \times 4$ shown for illustration; we considered $4 \times 4$, $5 \times 5$, and $7 \times 7$); (c) and (d): two types of spatial reasoning tasks that additionally require abstract reasoning. In (c), a match is expected whenever the letter X occurs in the same row and/or column as the location n trials ago (including identical locations); in (d), a match is expected when the letter X appears in the same row or column (but not both) as the location n trials ago (excluding identical locations). Refer to Table \ref{['tab:spatial_prompts']} for detailed prompts.
Figure 5: Results of different variants of verbal n-back experiments. Error bars represent $\pm1$SEM.
...and 24 more figures

Working Memory Capacity of ChatGPT: An Empirical Study

TL;DR

Abstract

Working Memory Capacity of ChatGPT: An Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (29)