Non-Determinism of "Deterministic" LLM Settings
Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, Breck Baldwin
TL;DR
The paper systematically evaluates non-determinism in large language models when hyper-parameters are set to maximize determinism (temperature=0) across eight tasks from BBH and MMLU. It introduces TARr@N and TARa@N to quantify stability at raw outputs and parsed answers, revealing 5–15% accuracy variability and substantial model- and task-specific differences. The findings show parsed answers are more stable than raw strings but still exhibit notable instability, with output length and task type influencing reliability. The work highlights practical engineering implications for benchmarks, testing, and deployment, and proposes reporting stability metrics across multiple runs as a more robust evaluation standard.
Abstract
LLM (large language model) practitioners commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. Yet the questions of how pervasive this is, and with what impact on results, have not to our knowledge been systematically investigated. We investigate non-determinism in five LLMs configured to be deterministic when applied to eight common tasks in across 10 runs, in both zero-shot and few-shot settings. We see accuracy variations up to 15% across naturally occurring runs with a gap of best possible performance to worst possible performance up to 70%. In fact, none of the LLMs consistently delivers repeatable accuracy across all tasks, much less identical output strings. Sharing preliminary results with insiders has revealed that non-determinism perhaps essential to the efficient use of compute resources via co-mingled data in input buffers so this issue is not going away anytime soon. To better quantify our observations, we introduce metrics focused on quantifying determinism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data are publicly available at https://github.com/breckbaldwin/llm-stability.
