Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Mengyang Qiu, Zoe Brisebois, Siena Sun
TL;DR
The paper interrogates whether large language models can simulate human behavioral variability in a phonemic fluency task. By evaluating 34 configurations across six providers and comparing outputs to 106 human participants, the study shows that while some LLMs can match average performance, they fail to reproduce human variability and the retrieval structures evident in human memory search, with Claude 3.7 Sonnet being the closest to human averages. Item-level analyses reveal similar but attenuated associations with lexical variables and Zipfian distributions, and network analyses identify fundamental differences in how humans and LLMs organize word retrieval. The results highlight critical limitations in using LLMs as proxies for human cognition in behavioral research and suggest that current approaches, including prompts, temperature adjustments, and ensembles, are insufficient to capture human-like variability; future work should pursue persona-based or cognitive-profile modeling to better emulate individual differences.
Abstract
Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear. This study examines whether LLMs can approximate individual differences in the phonemic fluency task, where participants generate words beginning with a target letter. We evaluated 34 model configurations, varying prompt specificity, sampling temperature, and model type, and compared outputs to responses from 106 human participants. While some configurations, especially Claude 3.7 Sonnet, matched human averages and lexical preferences, none reproduced the scope of human variability. LLM outputs were consistently less diverse and structurally rigid, and LLM ensembles failed to increase diversity. Network analyses further revealed fundamental differences in retrieval structure between humans and models. These results highlight key limitations in using LLMs to simulate human cognition and behavior.
