Table of Contents
Fetching ...

PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback

Yun Peng, Akhilesh Deepak Gotmare, Michael Lyu, Caiming Xiong, Silvio Savarese, Doyen Sahoo

TL;DR

PerfCodeGen introduces a training-free framework that augments LLM-generated code with execution feedback to improve runtime efficiency. It splits the process into correctness refinement using unit-test feedback and a subsequent performance refinement driven by the most time-consuming unit test, using a greedy one-shot improvement and fallback to the fastest correct seed. Across HumanEval, MBPP, and APPS, and on open as well as closed LLMs, PerfCodeGen yields significant gains in both correctness and runtime efficiency, with open models reaching performance levels near GPT-4 in some cases. The approach demonstrates substantial practical impact by enabling faster, high-quality code generation with open LLMs, while also outlining limitations related to measurement, scalability, and multi-objective optimization.

Abstract

Large Language Models (LLMs) are widely adopted for assisting in software development tasks, yet their performance evaluations have narrowly focused on the functional correctness of generated code. Human programmers, however, require LLM-generated code to be not only correct but also optimally efficient. We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code by incorporating feedback based on runtime during test case execution into the self-refinement iterations. With PerfCodeGen, we achieve speedups for a significantly higher proportion of problems compared to using the base LLM with sophisticated prompting techniques. Applied to open language models like Phi-3-mini, PerfCodeGen achieves runtime efficiency comparable to prompting powerful closed models like GPT-4. We achieve state-of-the-art runtime efficiency on benchmarks such as HumanEval, MBPP, and APPS, frequently surpassing the ground truth reference solutions with PerfCodeGen using GPT-3.5 and GPT-4. Additionally, we demonstrate the effectiveness of our approach in enhancing code quality across a range of open LLMs of varying sizes including Phi-3-mini, Llama 3 8B, Mixtral 8x7B, Command R, and Llama 3 70B.

PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback

TL;DR

PerfCodeGen introduces a training-free framework that augments LLM-generated code with execution feedback to improve runtime efficiency. It splits the process into correctness refinement using unit-test feedback and a subsequent performance refinement driven by the most time-consuming unit test, using a greedy one-shot improvement and fallback to the fastest correct seed. Across HumanEval, MBPP, and APPS, and on open as well as closed LLMs, PerfCodeGen yields significant gains in both correctness and runtime efficiency, with open models reaching performance levels near GPT-4 in some cases. The approach demonstrates substantial practical impact by enabling faster, high-quality code generation with open LLMs, while also outlining limitations related to measurement, scalability, and multi-objective optimization.

Abstract

Large Language Models (LLMs) are widely adopted for assisting in software development tasks, yet their performance evaluations have narrowly focused on the functional correctness of generated code. Human programmers, however, require LLM-generated code to be not only correct but also optimally efficient. We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code by incorporating feedback based on runtime during test case execution into the self-refinement iterations. With PerfCodeGen, we achieve speedups for a significantly higher proportion of problems compared to using the base LLM with sophisticated prompting techniques. Applied to open language models like Phi-3-mini, PerfCodeGen achieves runtime efficiency comparable to prompting powerful closed models like GPT-4. We achieve state-of-the-art runtime efficiency on benchmarks such as HumanEval, MBPP, and APPS, frequently surpassing the ground truth reference solutions with PerfCodeGen using GPT-3.5 and GPT-4. Additionally, we demonstrate the effectiveness of our approach in enhancing code quality across a range of open LLMs of varying sizes including Phi-3-mini, Llama 3 8B, Mixtral 8x7B, Command R, and Llama 3 70B.

Paper Structure

This paper contains 20 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: PerfCodeGen Given a problem description ($1$), we prompt the LLM to generate a candidate solution ($2$), which is passed to an execution environment to collect feedback on correctness ($3$). The LLM is then prompted to self-reflect on this feedback in a planning stage, and accordingly generate a refinement using this context ($4$). This process is iterated over till correctness ($4$, $2$, $3$). Correct code obtained from this phase is self-refined for runtime-efficiency ($7$), and then passed to the environment to be executed ($5$) and the most time consuming unit test(s) is identified and passed as performance feedback to the LLM ($6$), that acts on it by generating a self-refinement to make the correct solution more efficient ($7$). This refinement is tested for correctness ($2$, $3$) and passed as the final code solution to the problem ($8$) if correct, else we fall back to correct program from ($3$) if any.
  • Figure 2: Base prompt for HumanEval and other benchmarks.
  • Figure 3: The two correctness prompts discussed in the paper. PerfCodeGen uses the reflection and test case feedback prompt in (a).
  • Figure 4: The single-round runtime performance improving prompts.
  • Figure 5: The multi-round runtime performance improving prompts.
  • ...and 5 more figures