Table of Contents
Fetching ...

Non-Determinism of "Deterministic" LLM Settings

Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, Breck Baldwin

TL;DR

The paper systematically evaluates non-determinism in large language models when hyper-parameters are set to maximize determinism (temperature=0) across eight tasks from BBH and MMLU. It introduces TARr@N and TARa@N to quantify stability at raw outputs and parsed answers, revealing 5–15% accuracy variability and substantial model- and task-specific differences. The findings show parsed answers are more stable than raw strings but still exhibit notable instability, with output length and task type influencing reliability. The work highlights practical engineering implications for benchmarks, testing, and deployment, and proposes reporting stability metrics across multiple runs as a more robust evaluation standard.

Abstract

LLM (large language model) practitioners commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. Yet the questions of how pervasive this is, and with what impact on results, have not to our knowledge been systematically investigated. We investigate non-determinism in five LLMs configured to be deterministic when applied to eight common tasks in across 10 runs, in both zero-shot and few-shot settings. We see accuracy variations up to 15% across naturally occurring runs with a gap of best possible performance to worst possible performance up to 70%. In fact, none of the LLMs consistently delivers repeatable accuracy across all tasks, much less identical output strings. Sharing preliminary results with insiders has revealed that non-determinism perhaps essential to the efficient use of compute resources via co-mingled data in input buffers so this issue is not going away anytime soon. To better quantify our observations, we introduce metrics focused on quantifying determinism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data are publicly available at https://github.com/breckbaldwin/llm-stability.

Non-Determinism of "Deterministic" LLM Settings

TL;DR

The paper systematically evaluates non-determinism in large language models when hyper-parameters are set to maximize determinism (temperature=0) across eight tasks from BBH and MMLU. It introduces TARr@N and TARa@N to quantify stability at raw outputs and parsed answers, revealing 5–15% accuracy variability and substantial model- and task-specific differences. The findings show parsed answers are more stable than raw strings but still exhibit notable instability, with output length and task type influencing reliability. The work highlights practical engineering implications for benchmarks, testing, and deployment, and proposes reporting stability metrics across multiple runs as a more robust evaluation standard.

Abstract

LLM (large language model) practitioners commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. Yet the questions of how pervasive this is, and with what impact on results, have not to our knowledge been systematically investigated. We investigate non-determinism in five LLMs configured to be deterministic when applied to eight common tasks in across 10 runs, in both zero-shot and few-shot settings. We see accuracy variations up to 15% across naturally occurring runs with a gap of best possible performance to worst possible performance up to 70%. In fact, none of the LLMs consistently delivers repeatable accuracy across all tasks, much less identical output strings. Sharing preliminary results with insiders has revealed that non-determinism perhaps essential to the efficient use of compute resources via co-mingled data in input buffers so this issue is not going away anytime soon. To better quantify our observations, we introduce metrics focused on quantifying determinism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data are publicly available at https://github.com/breckbaldwin/llm-stability.
Paper Structure (17 sections, 1 equation, 12 figures, 3 tables)

This paper contains 17 sections, 1 equation, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Percentage difference between maximum and minimum accuracy in 10 runs per model, for 5 models on 8 tasks with zero-shot and few-shot settings.
  • Figure 2: Accuracy over 20 identical runs on college math, temperature=0, top-p=1. Median in blue, mean in black with dashed 5% and 95% quantiles.
  • Figure 3: TARr@10 for each model in the few-shot setting. Dataset colors have been chosen to distinguish them by relatively challenging (increasingly dark red hues) versus relatively easy (increasingly dark blue hues).
  • Figure 4: TARa@10 for each task in the few-shot setting. Models colors have been chosen to distinguish them by relatively low performing (increasingly dark red hues) versus relatively high performing (increasingly dark blue hues).
  • Figure 5: Spearman correlation matrix between metrics in few-shot setting (on the left) and zero-shot setting (on the right).
  • ...and 7 more figures