Creative and Correct: Requesting Diverse Code Solutions from AI Foundation Models

Scott Blyth; Markus Wagner; Christoph Treude

Creative and Correct: Requesting Diverse Code Solutions from AI Foundation Models

Scott Blyth, Markus Wagner, Christoph Treude

TL;DR

This study systematically investigates the trade-off between diversity and correctness in AI foundation models using experiments with HumanEval tasks, identifying combinations of parameters and strategies that strike an optimal balance between diversity and correctness.

Abstract

AI foundation models have the capability to produce a wide array of responses to a single prompt, a feature that is highly beneficial in software engineering to generate diverse code solutions. However, this advantage introduces a significant trade-off between diversity and correctness. In software engineering tasks, diversity is key to exploring design spaces and fostering creativity, but the practical value of these solutions is heavily dependent on their correctness. Our study systematically investigates this trade-off using experiments with HumanEval tasks, exploring various parameter settings and prompting strategies. We assess the diversity of code solutions using similarity metrics from the code clone community. The study identifies combinations of parameters and strategies that strike an optimal balance between diversity and correctness, situated on the Pareto front of this trade-off space. These findings offer valuable insights for software engineers on how to effectively use AI foundation models to generate code solutions that are diverse and accurate.

Creative and Correct: Requesting Diverse Code Solutions from AI Foundation Models

TL;DR

Abstract

Paper Structure (20 sections, 2 equations, 2 figures, 2 tables)

This paper contains 20 sections, 2 equations, 2 figures, 2 tables.

Introduction
HumanEval Tasks
Diversity Assessment
Correctness Assessment
Configuring Foundation Models
temperature
top_p
frequency_penalty
presence_penalty
logit bias
Prompting Techniques
Regeneration
n_different
n_k_different
Prompt Classification
...and 5 more sections

Figures (2)

Figure 1: n_k_different prompt with Logit Bias; UML sequence diagram.
Figure 2: Code similarity and correctness for all 20 approaches from Section \ref{['sec:oneatatime']} and for the 2 approaches from Section \ref{['sec:combination']}. Mean similarity scores are reported across all 164 HumanEval tasks. The red star shows the starting point of our investigations (A0). Sub-figures (a) and (b): the purple/green triangles represent the Pareto fronts (from left to right): purple A15, A14, A1, A20, green A15, A1, A20. The light-green diamonds represent A21 and A22 in Section \ref{['sec:combination']}. Sub-figure (c): we show the correlation of the two clone detection approaches (Spearman correlation coefficient 0.993).

Creative and Correct: Requesting Diverse Code Solutions from AI Foundation Models

TL;DR

Abstract

Creative and Correct: Requesting Diverse Code Solutions from AI Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)