Table of Contents
Fetching ...

Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?

Jean-Baptiste Döderlein, Nguessan Hermann Kouadio, Mathieu Acher, Djamel Eddine Khelladi, Benoit Combemale

TL;DR

The paper systematically analyzes how prompt design and temperature settings influence code generation by Copilot, Codex, and StarCoder2 across HumanEval and LeetCode, introducing operators to variably alter inputs. It demonstrates that carefully chosen prompt variations and temperature can dramatically boost pass@k scores, with Codex able to reach near-perfect performance under certain configurations. However, effects are highly model-, dataset-, and problem-dependent, making universal recipes elusive and highlighting brittleness and the need for per-problem tuning. The replication with StarCoder2-15B confirms that prompt formulation and temperature control are broadly influential beyond a single model, underscoring the importance of transparent configuration and reproducible evaluation in LLM-based code generation. Overall, the work provides both upper-bound insights and practical cautions for deploying LLMs as code assistants in real-world development workflows.

Abstract

Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently gained attention in code assistants, which generate programs from a natural language task description (prompt). They have the potential to save time and effort but remain poorly understood, limiting their optimal use. In this article, we investigate the impact of input variations on two configurations of a language model, focusing on parameters such as task description, surrounding context, model creativity, and the number of generated solutions. We design specific operators to modify these inputs and apply them to three LLM-based code assistants (Copilot, Codex, StarCoder2) and two benchmarks representing algorithmic problems (HumanEval, LeetCode). Our study examines whether these variations significantly affect program quality and how these effects generalize across models. Our results show that varying input parameters can greatly improve performance, achieving up to 79.27% success in one-shot generation compared to 22.44% for Codex and 31.1% for Copilot in default settings. Actioning this potential in practice is challenging due to the complex interplay in our study - the optimal settings for temperature, prompt, and number of generated solutions vary by problem. Reproducing our study with StarCoder2 confirms these findings, indicating they are not model-specific. We also uncover surprising behaviors (e.g., fully removing the prompt can be effective), revealing model brittleness and areas for improvement.

Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?

TL;DR

The paper systematically analyzes how prompt design and temperature settings influence code generation by Copilot, Codex, and StarCoder2 across HumanEval and LeetCode, introducing operators to variably alter inputs. It demonstrates that carefully chosen prompt variations and temperature can dramatically boost pass@k scores, with Codex able to reach near-perfect performance under certain configurations. However, effects are highly model-, dataset-, and problem-dependent, making universal recipes elusive and highlighting brittleness and the need for per-problem tuning. The replication with StarCoder2-15B confirms that prompt formulation and temperature control are broadly influential beyond a single model, underscoring the importance of transparent configuration and reproducible evaluation in LLM-based code generation. Overall, the work provides both upper-bound insights and practical cautions for deploying LLMs as code assistants in real-world development workflows.

Abstract

Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently gained attention in code assistants, which generate programs from a natural language task description (prompt). They have the potential to save time and effort but remain poorly understood, limiting their optimal use. In this article, we investigate the impact of input variations on two configurations of a language model, focusing on parameters such as task description, surrounding context, model creativity, and the number of generated solutions. We design specific operators to modify these inputs and apply them to three LLM-based code assistants (Copilot, Codex, StarCoder2) and two benchmarks representing algorithmic problems (HumanEval, LeetCode). Our study examines whether these variations significantly affect program quality and how these effects generalize across models. Our results show that varying input parameters can greatly improve performance, achieving up to 79.27% success in one-shot generation compared to 22.44% for Codex and 31.1% for Copilot in default settings. Actioning this potential in practice is challenging due to the complex interplay in our study - the optimal settings for temperature, prompt, and number of generated solutions vary by problem. Reproducing our study with StarCoder2 confirms these findings, indicating they are not model-specific. We also uncover surprising behaviors (e.g., fully removing the prompt can be effective), revealing model brittleness and areas for improvement.
Paper Structure (49 sections, 2 equations, 4 figures, 16 tables)

This paper contains 49 sections, 2 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: Code generation of HumanEval problem choose_num with original prompt (a) and modified prompts (b and c). There is the prompt (blue background) and the code generated by Copilot (green background)
  • Figure 2: Pass@k Evaluation of Codex on HumanEval dataset for some variation; $temp \in$[0.0,0.2,0.4,0.6,0.8,1]
  • Figure 3: Pass@1 Evaluation of Codex on Leetcode dataset; $temp \in$[0.0,0.2,0.4,0.6,0.8,1]
  • Figure 4: Pass@k Evaluation of StarCoder2-15B on HumanEval dataset for some variation; $temp \in$[0.001,0.2,0.4,0.6,0.8,1]