Table of Contents
Fetching ...

Deterministic or probabilistic? The psychology of LLMs as random number generators

Javier Coronado-Blázquez

TL;DR

The paper investigates whether LLMs can generate true random numbers or merely reproduce biases from training data and linguistic cues. It uses a large-scale multilingual, multi-model setup spanning three number ranges, six temperatures, and seven languages, with 75,600 prompts to quantify randomness via a proposed randomness index $RI$ and statistical tests ($\chi^2$, $\phi_C$), benchmarked against Python's random.randint(). Key findings show pervasive nonuniformity and language/model-specific biases, with central-value tendencies in low and medium ranges, and strong constraints in high ranges for several API-based models; DeepSeek-R1's chain-of-thought logs illustrate internal reasoning patterns that do not yield true randomness. The results have practical implications for any application requiring unpredictable randomness and motivate future work on bias mitigation, prompt design, and broader multilingual evaluation.

Abstract

Large Language Models (LLMs) have transformed text generation through inherently probabilistic context-aware mechanisms, mimicking human natural language. In this paper, we systematically investigate the performance of various LLMs when generating random numbers, considering diverse configurations such as different model architectures, numerical ranges, temperature, and prompt languages. Our results reveal that, despite their stochastic transformers-based architecture, these models often exhibit deterministic responses when prompted for random numerical outputs. In particular, we find significant differences when changing the model, as well as the prompt language, attributing this phenomenon to biases deeply embedded within the training data. Models such as DeepSeek-R1 can shed some light on the internal reasoning process of LLMs, despite arriving to similar results. These biases induce predictable patterns that undermine genuine randomness, as LLMs are nothing but reproducing our own human cognitive biases.

Deterministic or probabilistic? The psychology of LLMs as random number generators

TL;DR

The paper investigates whether LLMs can generate true random numbers or merely reproduce biases from training data and linguistic cues. It uses a large-scale multilingual, multi-model setup spanning three number ranges, six temperatures, and seven languages, with 75,600 prompts to quantify randomness via a proposed randomness index and statistical tests (, ), benchmarked against Python's random.randint(). Key findings show pervasive nonuniformity and language/model-specific biases, with central-value tendencies in low and medium ranges, and strong constraints in high ranges for several API-based models; DeepSeek-R1's chain-of-thought logs illustrate internal reasoning patterns that do not yield true randomness. The results have practical implications for any application requiring unpredictable randomness and motivate future work on bias mitigation, prompt design, and broader multilingual evaluation.

Abstract

Large Language Models (LLMs) have transformed text generation through inherently probabilistic context-aware mechanisms, mimicking human natural language. In this paper, we systematically investigate the performance of various LLMs when generating random numbers, considering diverse configurations such as different model architectures, numerical ranges, temperature, and prompt languages. Our results reveal that, despite their stochastic transformers-based architecture, these models often exhibit deterministic responses when prompted for random numerical outputs. In particular, we find significant differences when changing the model, as well as the prompt language, attributing this phenomenon to biases deeply embedded within the training data. Models such as DeepSeek-R1 can shed some light on the internal reasoning process of LLMs, despite arriving to similar results. These biases induce predictable patterns that undermine genuine randomness, as LLMs are nothing but reproducing our own human cognitive biases.

Paper Structure

This paper contains 10 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Heatmaps for the 1--5 range configuration in the six tested models, showing the distribution of the generated random numbers (X axis) for a Spanish prompt, depending on the temperature of the model (Y axis). The color bar is set between 0 and 100 in every case.
  • Figure 2: Heatmaps for the 1--5 range configuration showing the distribution of the generated random numbers (X axis) for different languages in the Gemini 2.0 model, depending on the temperature of the model (Y axis). The color bar is set between 0 and 100 in every case.
  • Figure 3: Distribution of numbers in the range 1-5 with Python randint() module and the best-ranked LLM according to its p-value, Llama 3.1-8b with $T=0.1$ in Spanish. Over-imposed in red we show a uniform distribution within the range.
  • Figure 4: Distribution of the computed randomness index (see Eq. \ref{['eq:ri']}) for the 1--5 range. Blue distribution is the one obtained from LLMs, and yellow distribution is the Python randint sampling. Vertical, dashed lines mark their respective median values.
  • Figure 5: Distribution of generated random numbers in the 1--10 range for four different languages (rows) and extreme temperatures (columns). Each plot shows the six tested LLMs in the Y axis. The color bar is set between 0 and 100 in every case.
  • ...and 7 more figures