Table of Contents
Fetching ...

An Exploratory Study of Bayesian Prompt Optimization for Test-Driven Code Generation with Large Language Models

Shlok Tomar, Aryan Deshwal, Ethan Villalovoz, Mattia Fazzini, Haipeng Cai, Janardhan Rao Doppa

TL;DR

The paper tackles the problem of generating functionally correct code with LLMs by automating prompt optimization. It introduces BODE-GEN, a Bayesian optimization framework that searches prompt space in a continuous embedding domain via an auxiliary LLM to produce discrete prompts for a base LLM, using random projections and dimensionality-scaled priors to handle high-dimensional embeddings. Through experiments on HumanEval+ across multiple base LLMs, BODE-GEN demonstrates significant gains in code correctness and sample efficiency compared to fixed prompts, CoT, and OPRO. The work highlights concrete prompt-design patterns that correlate with correct code and suggests directions for reducing reliance on test cases while improving prompt optimization for code synthesis.

Abstract

We consider the task of generating functionally correct code using large language models (LLMs). The correctness of generated code is influenced by the prompt used to query the given base LLM. We formulate the problem of finding the appropriate prompt as combinatorial search process and propose a Bayesian optimization (BO) approach referred to as {\em BO for Code GENeration (BODE-GEN)}. BODE-GEN performs an adaptive data-driven search over prompts guided by training data in the form of prompts tried and the functional accuracy of the generated code over a set of given test cases. The key insight is to perform BO in continuous embedding space by using an auxiliary LLM to bridge the gap between discrete prompt space and continuous embedding space. We leverage two synergistic ideas, namely, random projections and dimensionality scaled priors, to build effective Gaussian process based surrogate models over the high-dimensional embedding space. Our experiments on the HumanEval+ benchmark using multiple base LLMs show that BODE-GEN can improve performance in terms of code generation accuracy compared to fixed prompts and manual prompt engineering. Additionally, we demonstrate that BODE-GEN is sample-efficient, requiring relatively few iterations of BO to demonstrate improvements in code accuracy.

An Exploratory Study of Bayesian Prompt Optimization for Test-Driven Code Generation with Large Language Models

TL;DR

The paper tackles the problem of generating functionally correct code with LLMs by automating prompt optimization. It introduces BODE-GEN, a Bayesian optimization framework that searches prompt space in a continuous embedding domain via an auxiliary LLM to produce discrete prompts for a base LLM, using random projections and dimensionality-scaled priors to handle high-dimensional embeddings. Through experiments on HumanEval+ across multiple base LLMs, BODE-GEN demonstrates significant gains in code correctness and sample efficiency compared to fixed prompts, CoT, and OPRO. The work highlights concrete prompt-design patterns that correlate with correct code and suggests directions for reducing reliance on test cases while improving prompt optimization for code synthesis.

Abstract

We consider the task of generating functionally correct code using large language models (LLMs). The correctness of generated code is influenced by the prompt used to query the given base LLM. We formulate the problem of finding the appropriate prompt as combinatorial search process and propose a Bayesian optimization (BO) approach referred to as {\em BO for Code GENeration (BODE-GEN)}. BODE-GEN performs an adaptive data-driven search over prompts guided by training data in the form of prompts tried and the functional accuracy of the generated code over a set of given test cases. The key insight is to perform BO in continuous embedding space by using an auxiliary LLM to bridge the gap between discrete prompt space and continuous embedding space. We leverage two synergistic ideas, namely, random projections and dimensionality scaled priors, to build effective Gaussian process based surrogate models over the high-dimensional embedding space. Our experiments on the HumanEval+ benchmark using multiple base LLMs show that BODE-GEN can improve performance in terms of code generation accuracy compared to fixed prompts and manual prompt engineering. Additionally, we demonstrate that BODE-GEN is sample-efficient, requiring relatively few iterations of BO to demonstrate improvements in code accuracy.

Paper Structure

This paper contains 16 sections, 2 equations, 15 figures, 1 algorithm.

Figures (15)

  • Figure 1: High-level overview of our BODE-GEN approach. The method begins with a set of candidate embeddings $E_c$=$\{e_1, e_2,\cdots, e_m \}$ proposed by the Bayesian Optimization algorithm. These embeddings combined with the initial prompt embeddings $E_{p_0}$ and fixed instruction embeddings $E_I$ are passed to the auxiliary LLM which generates an interpretable text prompt $p_c$. Subsequently, the base LLM is queried with this prompt $p_c$ to generate the code which is evaluated for functional correctness through code execution on a set of developer provided test cases. The percentage of test cases passed by the code is used as the objective function value for the BO procedure. This overall procedure is repeated for a fixed number of iterations or until we find a prompt that generates code with maximizing code generation accuracy on the given test cases.
  • Figure 3: Results showing BODE-GEN's performance on ChatGPT 3.5(Turbo) as base-LLM as a function of number of iterations (number of base LLM calls with different prompts) on a subset of representative tasks from HumanEval+ benchmark. Note that the objective value for BO is the percentage of test cases passed by the generated code for a given coding task. As shown in the figure, prompts suggested by BODE-GEN are often able to reach high 100% code generation correctness. Each BO iteration corresponds to roughly one query to the base LLM (precisely it is three queries per iteration since we generate three samples for each prompt to compute pass @1).
  • Figure 4: Results showing BODE-GEN's performance on CodeLlama-7b as base-LLM as a function of number of iterations (number of base LLM calls with different prompts) on a subset of representative tasks from HumanEval+ benchmark.
  • Figure 5: Results showing BODE-GEN's performance on DeepSeek-Coder as base-LLM as a function of number of iterations (number of base LLM calls with different prompts) on a subset of representative tasks from HumanEval+ benchmark.
  • Figure 6: Results comparing the overall performance of BODE-GEN with zero-shot CoT and OPRO with ChatGPT 3.5(Turbo) as base-LLM on a grouping of HumanEval+ tasks based on a notion of difficulty measured as the correctness of the code generated by initial prompts given for each task. For example, the easy/hard class refers to all tasks for which the code generated via initial prompt achieves correctness (above 67%/below 30%) respectively. The medium class contains all tasks with their initial prompts' correctness between 30-67%.
  • ...and 10 more figures