A Comparison of Large Language Model and Human Performance on Random Number Generation Tasks
Rachel M. Harrison
TL;DR
This paper investigates whether a large language model trained on human text can reproduce human RNG biases by adapting a standard RNGT for an LLM and evaluating ChatGPT-3.5. It uses a target sequence length drawn from $N(269, 325^2)$, with 10,000 sequences, and analyzes metrics such as repeat frequency, adjacent increases/decreases, and digit frequencies, comparing results to human data and a uniformly random baseline. Results indicate ChatGPT is more random than humans in avoiding repeats and adjacent patterns and aligns more with pseudorandom expectations on increases/decreases, though it still exhibits non-ideal randomness. The work highlights how LLM training data and prompting shape RNG behavior and provides methodological insight for AI-assisted cognitive research, while outlining limitations and avenues for future exploration across more models and metrics.
Abstract
Random Number Generation Tasks (RNGTs) are used in psychology for examining how humans generate sequences devoid of predictable patterns. By adapting an existing human RNGT for an LLM-compatible environment, this preliminary study tests whether ChatGPT-3.5, a large language model (LLM) trained on human-generated text, exhibits human-like cognitive biases when generating random number sequences. Initial findings indicate that ChatGPT-3.5 more effectively avoids repetitive and sequential patterns compared to humans, with notably lower repeat frequencies and adjacent number frequencies. Continued research into different models, parameters, and prompting methodologies will deepen our understanding of how LLMs can more closely mimic human random generation behaviors, while also broadening their applications in cognitive and behavioral science research.
