How Random is Random? Evaluating the Randomness and Humaness of LLMs' Coin Flips
Katherine Van Koevering, Jon Kleinberg
TL;DR
The paper investigates whether large language models produce true randomness or human-like biases when generating binary sequences. It conducts a comprehensive coin-flip study across GPT-3.5, GPT-4, and Llama-3, using prompts such as $"Flip 20 coins"$ and metrics including first-flip bias, fraction of heads, runs, n-grams, Mean Squared Error (MSE), and correlates across temperatures $0 \,\le\, T \,\le\, 1.5$. The findings show GPT-4 and Llama-3 amplify several human randomness biases, while GPT-3.5 remains comparatively more random, especially at higher temperatures. The work highlights a fundamental trade-off between human-like biases and machine-like randomness in LLMs and calls for standardized randomness benchmarks to guide model evaluation and deployment.
Abstract
One uniquely human trait is our inability to be random. We see and produce patterns where there should not be any and we do so in a predictable way. LLMs are supplied with human data and prone to human biases. In this work, we explore how LLMs approach randomness and where and how they fail through the lens of the well studied phenomena of generating binary random sequences. We find that GPT 4 and Llama 3 exhibit and exacerbate nearly every human bias we test in this context, but GPT 3.5 exhibits more random behavior. This dichotomy of randomness or humaness is proposed as a fundamental question of LLMs and that either behavior may be useful in different circumstances.
