Working Memory Capacity of ChatGPT: An Empirical Study
Dongyu Gong, Xingchen Wan, Dingmin Wang
TL;DR
This paper empirically assesses the working memory capacity of ChatGPT using verbal and spatial $n$-back tasks across multiple conditions. By measuring $d'$ across $n \\in\\\{1,2,3\\}$ and employing variants with noise, feedback, and reasoning prompts, the authors reveal a human-like capacity limit near $n=3$, with performance deteriorating under distraction and cognitive load. The study further compares several LLMs, finding GPT-4 to exhibit the strongest WM capacity, while open-source models show markedly lower capacity. The results suggest $n$-back tasks as effective benchmarks for WM in LLMs and highlight potential architecture-driven constraints, motivating future work to enhance AI working memory and its role in general intelligence.
Abstract
Working memory is a critical aspect of both human intelligence and artificial intelligence, serving as a workspace for the temporary storage and manipulation of information. In this paper, we systematically assess the working memory capacity of ChatGPT, a large language model developed by OpenAI, by examining its performance in verbal and spatial n-back tasks under various conditions. Our experiments reveal that ChatGPT has a working memory capacity limit strikingly similar to that of humans. Furthermore, we investigate the impact of different instruction strategies on ChatGPT's performance and observe that the fundamental patterns of a capacity limit persist. From our empirical findings, we propose that n-back tasks may serve as tools for benchmarking the working memory capacity of large language models and hold potential for informing future efforts aimed at enhancing AI working memory.
