Table of Contents
Fetching ...

Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation

Sixiang Ye, Zeyu Sun, Guoqing Wang, Liwei Guo, Qingyuan Liang, Zheng Li, Yong Liu

TL;DR

Prochemy addresses the critical dependency of code-generation quality on prompt design by introducing an automated, execution-driven prompt refinement framework. It automates training-set generation from existing data and mutated data, then iteratively mutates, evaluates, and selects prompts to converge on a fixed final prompt, with plug-and-play compatibility for existing CoT and multi-agent workflows. Across natural-language-based code generation and code translation, Prochemy yields consistent gains, achieving state-of-the-art results on HumanEval with LDB integration and notable improvements on LiveCodeBench and CodeNet/AVATAR datasets. The framework demonstrates robustness across diverse models, maintains competitive token/time overhead, and offers interpretable prompt-optimization patterns, making it a practical, scalable approach for improving LLM-driven software development tasks.

Abstract

Code generation has emerged as a key task to automate software development by converting high-level descriptions into executable code. Large language models (LLMs) excel at this but depend heavily on input prompt quality.Manual prompt engineering can be time-consuming and inconsistent, limiting LLM effectiveness. This paper introduces Prochemy, an innovative method for automatically refining prompts to boost code generation. Prochemy overcomes manual prompt limitations by automating optimization, ensuring consistency during inference, and supporting multi-agent systems.It iteratively refines prompts based on model performance, using an optimized final prompt for improved consistency across tasks. We tested Prochemy on natural language-based code generation and translation tasks using three LLM series. Results indicate Prochemy enhances existing methods, improving performance by 5.0% for GPT-3.5-Turbo and 1.9% for GPT-4o over zero-shot baselines on HumanEval. In state-of-the-art LDB, Prochemy + LDB surpasses standalone methods by 1.2-1.8%. For code translation, Prochemy boosts GPT-4o's Java-to-Python (AVATAR) performance from 74.5 to 84.1 (+12.9%) and Python-to-Java from 66.8 to 78.2 (+17.1%). Moreover, Prochemy maintains strong performance when integrated with the o1-mini model, validating its efficacy in code tasks. Designed as plug-and-play, Prochemy optimizes prompts with minimal human input, bridging the gap between simple prompts and complex frameworks.

Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation

TL;DR

Prochemy addresses the critical dependency of code-generation quality on prompt design by introducing an automated, execution-driven prompt refinement framework. It automates training-set generation from existing data and mutated data, then iteratively mutates, evaluates, and selects prompts to converge on a fixed final prompt, with plug-and-play compatibility for existing CoT and multi-agent workflows. Across natural-language-based code generation and code translation, Prochemy yields consistent gains, achieving state-of-the-art results on HumanEval with LDB integration and notable improvements on LiveCodeBench and CodeNet/AVATAR datasets. The framework demonstrates robustness across diverse models, maintains competitive token/time overhead, and offers interpretable prompt-optimization patterns, making it a practical, scalable approach for improving LLM-driven software development tasks.

Abstract

Code generation has emerged as a key task to automate software development by converting high-level descriptions into executable code. Large language models (LLMs) excel at this but depend heavily on input prompt quality.Manual prompt engineering can be time-consuming and inconsistent, limiting LLM effectiveness. This paper introduces Prochemy, an innovative method for automatically refining prompts to boost code generation. Prochemy overcomes manual prompt limitations by automating optimization, ensuring consistency during inference, and supporting multi-agent systems.It iteratively refines prompts based on model performance, using an optimized final prompt for improved consistency across tasks. We tested Prochemy on natural language-based code generation and translation tasks using three LLM series. Results indicate Prochemy enhances existing methods, improving performance by 5.0% for GPT-3.5-Turbo and 1.9% for GPT-4o over zero-shot baselines on HumanEval. In state-of-the-art LDB, Prochemy + LDB surpasses standalone methods by 1.2-1.8%. For code translation, Prochemy boosts GPT-4o's Java-to-Python (AVATAR) performance from 74.5 to 84.1 (+12.9%) and Python-to-Java from 66.8 to 78.2 (+17.1%). Moreover, Prochemy maintains strong performance when integrated with the o1-mini model, validating its efficacy in code tasks. Designed as plug-and-play, Prochemy optimizes prompts with minimal human input, bridging the gap between simple prompts and complex frameworks.

Paper Structure

This paper contains 48 sections, 6 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of Prochemy
  • Figure 2: Case 1 for Optimized Prompt
  • Figure 3: Case 2 HumanEval/108 for Code Generation
  • Figure 4: Case 3 atcoder_ABC137_D for Code Translation