Table of Contents
Fetching ...

Measuring Determinism in Large Language Models for Software Code Review

Eugene Klishevich, Yegor Denisov-Blanch, Simon Obstbaum, Igor Ciobanu, Michal Kosinski

TL;DR

The paper investigates the determinism of large language models when used for software code reviews. It evaluates four state-of-the-art LLMs on 70 Java commits, using zero-temperature prompts and multiple repeated runs across three prompt lengths, to quantify consistency via Pearson correlation and bootstrap-derived 95% confidence intervals. A human baseline is provided through prior ICC measurements to assess how LLM reliability compares to human reviewers. The findings show that LLMs remain non-deterministic despite zero temperature, with variability differing by model, underscoring the need for cautious deployment and strategies to mitigate reliability issues. The work highlights practical implications, such as ensemble approaches and openness of models, and outlines limitations and directions for future research to improve stability in automated code-review assistance.

Abstract

Large Language Models (LLMs) promise to streamline software code reviews, but their ability to produce consistent assessments remains an open question. In this study, we tested four leading LLMs -- GPT-4o mini, GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 90B Vision -- on 70 Java commits from both private and public repositories. By setting each model's temperature to zero, clearing context, and repeating the exact same prompts five times, we measured how consistently each model generated code-review assessments. Our results reveal that even with temperature minimized, LLM responses varied to different degrees. These findings highlight a consideration about the inherently limited consistency (test-retest reliability) of LLMs -- even when the temperature is set to zero -- and the need for caution when using LLM-generated code reviews to make real-world decisions.

Measuring Determinism in Large Language Models for Software Code Review

TL;DR

The paper investigates the determinism of large language models when used for software code reviews. It evaluates four state-of-the-art LLMs on 70 Java commits, using zero-temperature prompts and multiple repeated runs across three prompt lengths, to quantify consistency via Pearson correlation and bootstrap-derived 95% confidence intervals. A human baseline is provided through prior ICC measurements to assess how LLM reliability compares to human reviewers. The findings show that LLMs remain non-deterministic despite zero temperature, with variability differing by model, underscoring the need for cautious deployment and strategies to mitigate reliability issues. The work highlights practical implications, such as ensemble approaches and openness of models, and outlines limitations and directions for future research to improve stability in automated code-review assistance.

Abstract

Large Language Models (LLMs) promise to streamline software code reviews, but their ability to produce consistent assessments remains an open question. In this study, we tested four leading LLMs -- GPT-4o mini, GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 90B Vision -- on 70 Java commits from both private and public repositories. By setting each model's temperature to zero, clearing context, and repeating the exact same prompts five times, we measured how consistently each model generated code-review assessments. Our results reveal that even with temperature minimized, LLM responses varied to different degrees. These findings highlight a consideration about the inherently limited consistency (test-retest reliability) of LLMs -- even when the temperature is set to zero -- and the need for caution when using LLM-generated code reviews to make real-world decisions.

Paper Structure

This paper contains 36 sections, 4 tables.