Table of Contents
Fetching ...

How Random is Random? Evaluating the Randomness and Humaness of LLMs' Coin Flips

Katherine Van Koevering, Jon Kleinberg

TL;DR

The paper investigates whether large language models produce true randomness or human-like biases when generating binary sequences. It conducts a comprehensive coin-flip study across GPT-3.5, GPT-4, and Llama-3, using prompts such as $"Flip 20 coins"$ and metrics including first-flip bias, fraction of heads, runs, n-grams, Mean Squared Error (MSE), and correlates across temperatures $0 \,\le\, T \,\le\, 1.5$. The findings show GPT-4 and Llama-3 amplify several human randomness biases, while GPT-3.5 remains comparatively more random, especially at higher temperatures. The work highlights a fundamental trade-off between human-like biases and machine-like randomness in LLMs and calls for standardized randomness benchmarks to guide model evaluation and deployment.

Abstract

One uniquely human trait is our inability to be random. We see and produce patterns where there should not be any and we do so in a predictable way. LLMs are supplied with human data and prone to human biases. In this work, we explore how LLMs approach randomness and where and how they fail through the lens of the well studied phenomena of generating binary random sequences. We find that GPT 4 and Llama 3 exhibit and exacerbate nearly every human bias we test in this context, but GPT 3.5 exhibits more random behavior. This dichotomy of randomness or humaness is proposed as a fundamental question of LLMs and that either behavior may be useful in different circumstances.

How Random is Random? Evaluating the Randomness and Humaness of LLMs' Coin Flips

TL;DR

The paper investigates whether large language models produce true randomness or human-like biases when generating binary sequences. It conducts a comprehensive coin-flip study across GPT-3.5, GPT-4, and Llama-3, using prompts such as and metrics including first-flip bias, fraction of heads, runs, n-grams, Mean Squared Error (MSE), and correlates across temperatures . The findings show GPT-4 and Llama-3 amplify several human randomness biases, while GPT-3.5 remains comparatively more random, especially at higher temperatures. The work highlights a fundamental trade-off between human-like biases and machine-like randomness in LLMs and calls for standardized randomness benchmarks to guide model evaluation and deployment.

Abstract

One uniquely human trait is our inability to be random. We see and produce patterns where there should not be any and we do so in a predictable way. LLMs are supplied with human data and prone to human biases. In this work, we explore how LLMs approach randomness and where and how they fail through the lens of the well studied phenomena of generating binary random sequences. We find that GPT 4 and Llama 3 exhibit and exacerbate nearly every human bias we test in this context, but GPT 3.5 exhibits more random behavior. This dichotomy of randomness or humaness is proposed as a fundamental question of LLMs and that either behavior may be useful in different circumstances.
Paper Structure (17 sections, 8 figures, 3 tables)

This paper contains 17 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Percentage of runs with $x$ heads for different models and prompts. The data for KLM is taken from Kleinberg et al. kleinberg2017theory that performed similar experiments looking at sequences of 8 coin flips taken from longer samples. Bernoulli represents a 'truly random' sequence of Bernoulli distributions. Note that KLM is most similar to the Bernoulli distribution, but that all three models, at higher temperatures, exhibit and exacerbate the human pattern of too many flips with nearly half heads and half tails and too few at the extremes.
  • Figure 2: Histograms of the number of alternations in an 8 flip sequence for our various models and a theoretical repeated bernoulli function for temperatures 0, 0.8, and 1.5. Note the strong bias towards alternations, especially at lower temperatures, exhibited by all three models.
  • Figure 3: Fraction of expected runs realized by various models and prompts for runs of length 2 through 7 for 7-flip sequences. Note that any point below 1.0 represents fewer runs of that length than expected, and above 1.0 represents more runs of that length than expected. The sample size is large enough in all cases to expect at least one of each run, but very few runs of the largest size are expected (or realized).
  • Figure 4: The fraction of 2-grams of each type of all 2-grams of that length for various temperatures and models. Each temperature represents our three models with three distinct bars - GPT 3.5, GPT 4, and Llama 3 respectively. The final two bars, 'H', represent data from Rapoport et al Rapoport1992, who did a similar analysis of human generated coin flips, on the left and the expected fractions at random on the right. The full table of n-gram percentages can be found in the appendix.
  • Figure 5: The fraction of 3-grams of each type of all 3-grams of that length for various temperatures and models. Each temperature represents our three models with three distinct bars - GPT 3.5, GPT 4, and Llama 3 respectively. The final two bars, 'H', represent data from Rapoport et al Rapoport1992, who did a similar analysis of human generated coin flips, on the left and the expected fractions at random on the right. The full table of n-gram percentages can be found in the appendix.
  • ...and 3 more figures