Table of Contents
Fetching ...

Studying How Configurations Impact Code Generation in LLMs: the Case of ChatGPT

Benedetta Donato, Leonardo Mariani, Daniela Micucci, Oliviero Riganelli

TL;DR

This work addresses how LLM configuration choices affect code generation by ChatGPT when producing Java method implementations. It uses a large-scale, reproducible methodology to vary $Temperature$ and $Top\!-p$ and to perform repeated prompts, revealing that $Top\!-p$ exerts a stronger influence than temperature and that higher creativity can increase the range of correct or plausible outputs. The study provides practical recommendations, notably that $T=1.2$, $Top\!-p=0.0$, with about five repetitions, yield a favorable balance between coverage and effort, and releases an open dataset to support replication. These insights help practitioners configure LLM-based coding tools more effectively and encourage more rigorous reporting of prompt settings in empirical evaluations.

Abstract

Leveraging LLMs for code generation is becoming increasingly common, as tools like ChatGPT can suggest method implementations with minimal input, such as a method signature and brief description. Empirical studies further highlight the effectiveness of LLMs in handling such tasks, demonstrating notable performance in code generation scenarios. However, LLMs are inherently non-deterministic, with their output influenced by parameters such as temperature, which regulates the model's level of creativity, and top-p, which controls the choice of the tokens that shall appear in the output. Despite their significance, the role of these parameters is often overlooked. This paper systematically studies the impact of these parameters, as well as the number of prompt repetitions required to account for non-determinism, in the context of 548 Java methods. We observe significantly different performances across different configurations of ChatGPT, with temperature having a marginal impact compared to the more prominent influence of the top-p parameter. Additionally, we show how creativity can enhance code generation tasks. Finally, we provide concrete recommendations for addressing the non-determinism of the model.

Studying How Configurations Impact Code Generation in LLMs: the Case of ChatGPT

TL;DR

This work addresses how LLM configuration choices affect code generation by ChatGPT when producing Java method implementations. It uses a large-scale, reproducible methodology to vary and and to perform repeated prompts, revealing that exerts a stronger influence than temperature and that higher creativity can increase the range of correct or plausible outputs. The study provides practical recommendations, notably that , , with about five repetitions, yield a favorable balance between coverage and effort, and releases an open dataset to support replication. These insights help practitioners configure LLM-based coding tools more effectively and encourage more rigorous reporting of prompt settings in empirical evaluations.

Abstract

Leveraging LLMs for code generation is becoming increasingly common, as tools like ChatGPT can suggest method implementations with minimal input, such as a method signature and brief description. Empirical studies further highlight the effectiveness of LLMs in handling such tasks, demonstrating notable performance in code generation scenarios. However, LLMs are inherently non-deterministic, with their output influenced by parameters such as temperature, which regulates the model's level of creativity, and top-p, which controls the choice of the tokens that shall appear in the output. Despite their significance, the role of these parameters is often overlooked. This paper systematically studies the impact of these parameters, as well as the number of prompt repetitions required to account for non-determinism, in the context of 548 Java methods. We observe significantly different performances across different configurations of ChatGPT, with temperature having a marginal impact compared to the more prominent influence of the top-p parameter. Additionally, we show how creativity can enhance code generation tasks. Finally, we provide concrete recommendations for addressing the non-determinism of the model.

Paper Structure

This paper contains 16 sections, 1 equation, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Dataset construction process.
  • Figure 2: Percentage of invalid, incorrect, and plausible responses for different temperature values.
  • Figure 3: Percentage of methods with at least a plausible, incorrect, or invalid implementation, considering the best result returned by a configuration across 10 repetitions.
  • Figure 4: Eulero-Venn diagram with the plausible methods generated at least once for each temperature value.
  • Figure 5: Percentage of invalid, incorrect, and plausible responses for different values of top-p.
  • ...and 9 more figures