Table of Contents
Fetching ...

Code Roulette: How Prompt Variability Affects LLM Code Generation

Andrei Paleyes, Radzim Sendyka, Diana Robinson, Christian Cabrera, Neil D. Lawrence

Abstract

Code generation is one of the most active areas of application of Large Language Models (LLMs). While LLMs lower barriers to writing code and accelerate development process, the overall quality of generated programs depends on the quality of given prompts. Specifically, functionality and quality of generated code can be sensitive to user's background and familiarity with software development. It is therefore important to quantify LLM's sensitivity to variations in the input. To this end we propose an evaluation pipeline for LLM code generation with a focus on measuring sensitivity to prompt augmentations, completely agnostic to a specific programming tasks and LLMs, and thus widely applicable. We provide extensive experimental evidence illustrating utility of our method and share our code for the benefit of the community.

Code Roulette: How Prompt Variability Affects LLM Code Generation

Abstract

Code generation is one of the most active areas of application of Large Language Models (LLMs). While LLMs lower barriers to writing code and accelerate development process, the overall quality of generated programs depends on the quality of given prompts. Specifically, functionality and quality of generated code can be sensitive to user's background and familiarity with software development. It is therefore important to quantify LLM's sensitivity to variations in the input. To this end we propose an evaluation pipeline for LLM code generation with a focus on measuring sensitivity to prompt augmentations, completely agnostic to a specific programming tasks and LLMs, and thus widely applicable. We provide extensive experimental evidence illustrating utility of our method and share our code for the benefit of the community.

Paper Structure

This paper contains 17 sections, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Diversity (i.e., Sacre BLEU) versus semantic similarity (i.e., BERT Score) of the generated paraphrases against the original prompt for Our Dataset. The generated paraphrases are semantically similar, while textually diverse.
  • Figure 2: Comparison of BERT and TSED metrics on an example task. While overall trend is comparable, TSED uses most of its possible range, while BERT only varies between 0.96 and 1.0.
  • Figure 3: Overall evaluation results. Solid lines represent mean values and shaded regions the 95% intervals, calculated from the set of approximately 3400 observations for each rate step. We can confirm that all models exhibit similar sensitivity to prompt augmentations, with Keyboard Typos being a more invasive augmentation method. Gemini 2.0 Flash is the most robust to synonym augmentation, while sensitivity of all models to typos is approximately the same.
  • Figure 4: Results of evaluating the LLM models with paraphrasing augmentation. X-axis shows four different levels of paraphrasing: original (unaltered) as well as low (0.5 - 1.0 BLEU distance), medium (0.2 - 0.5), and high (0.0 - 0.2) from the original. We can see paraphrasing augmentation exhibiting similar trend to synonyms - noticeable drop followed by slow gradual decrease in similarity.
  • Figure 5: Evaluation results for three datasets used in this study. Solid lines represent mean values and shaded regions the 95% intervals. We can see that LLMs show lowest sensitivity to modifications in tasks from LeetCode (Old) dataset, and highest sensitivity sensitivity to tasks from dataset we created from scratch.
  • ...and 1 more figures