Table of Contents
Fetching ...

Automated Prompt Generation for Code Intelligence: An Empirical study and Experience in WeChat

Kexing Ji, Shiyun Fu, Cuiyun Gao, Yujia Chen, Zezhou Yang, Chaozheng Wang, Yuetang Deng

TL;DR

This work addresses the brittleness and task sensitivity of prompts used by Large Code Models in code intelligence. It experimentally disentangles two APG components—Instruction Generation (IG) and Multi-Step Reasoning (MSR)—and evaluates leading methods (APE/OPRO for IG; CoT/AutoCoT/Self-Plan for MSR) across four open-source LCMs and three tasks (code translation, code summarization, API recommendation). The authors propose APE-CoT, a joint approach that combines the best IG and MSR methods, achieving substantial improvements on CodeBLEU, ROUGE-L, and SR@1/MRR metrics, and demonstrating strong industrial applicability on WeChat-Bench with large MRR gains. The study provides practical guidance for developers and researchers on automating prompt design to improve code intelligence, and validates the approach in a real-world enterprise setting. Overall, IG + MSR via APE-CoT offers a robust, scalable path to more reliable and effective code-centric LCMs.

Abstract

Large Code Models (LCMs) show potential in code intelligence, but their effectiveness is greatly influenced by prompt quality. Current prompt design is mostly manual, which is time-consuming and highly dependent on specific LCMs and tasks. While automated prompt generation (APG) exists in NLP, it is underexplored for code intelligence. This creates a gap, as automating the prompt process is essential for developers facing diverse tasks and black-box LCMs. To mitigate this, we empirically investigate two important parts of APG: Instruction Generation (IG) and Multi-Step Reasoning (MSR). IG provides a task-related description to instruct LCMs, while MSR guides them to produce logical steps before the final answer. We evaluate widely-used APG methods for each part on four open-source LCMs and three code intelligence tasks: code translation (PL-PL), code summarization (PL-NL), and API recommendation (NL-PL).Experimental results indicate that both IG and MSR dramatically enhance performance compared to basic prompts. Based on these results, we propose a novel APG approach combining the best methods of the two parts. Experiments show our approach achieves average improvements of 28.38% in CodeBLEU (code translation), 58.11% in ROUGE-L (code summarization), and 84.53% in SuccessRate@1 (API recommendation) over basic prompts. To validate its effectiveness in an industrial scenario, we evaluate our approach on WeChat-Bench, a proprietary dataset, achieving an average MRR improvement of 148.89% for API recommendation.

Automated Prompt Generation for Code Intelligence: An Empirical study and Experience in WeChat

TL;DR

This work addresses the brittleness and task sensitivity of prompts used by Large Code Models in code intelligence. It experimentally disentangles two APG components—Instruction Generation (IG) and Multi-Step Reasoning (MSR)—and evaluates leading methods (APE/OPRO for IG; CoT/AutoCoT/Self-Plan for MSR) across four open-source LCMs and three tasks (code translation, code summarization, API recommendation). The authors propose APE-CoT, a joint approach that combines the best IG and MSR methods, achieving substantial improvements on CodeBLEU, ROUGE-L, and SR@1/MRR metrics, and demonstrating strong industrial applicability on WeChat-Bench with large MRR gains. The study provides practical guidance for developers and researchers on automating prompt design to improve code intelligence, and validates the approach in a real-world enterprise setting. Overall, IG + MSR via APE-CoT offers a robust, scalable path to more reliable and effective code-centric LCMs.

Abstract

Large Code Models (LCMs) show potential in code intelligence, but their effectiveness is greatly influenced by prompt quality. Current prompt design is mostly manual, which is time-consuming and highly dependent on specific LCMs and tasks. While automated prompt generation (APG) exists in NLP, it is underexplored for code intelligence. This creates a gap, as automating the prompt process is essential for developers facing diverse tasks and black-box LCMs. To mitigate this, we empirically investigate two important parts of APG: Instruction Generation (IG) and Multi-Step Reasoning (MSR). IG provides a task-related description to instruct LCMs, while MSR guides them to produce logical steps before the final answer. We evaluate widely-used APG methods for each part on four open-source LCMs and three code intelligence tasks: code translation (PL-PL), code summarization (PL-NL), and API recommendation (NL-PL).Experimental results indicate that both IG and MSR dramatically enhance performance compared to basic prompts. Based on these results, we propose a novel APG approach combining the best methods of the two parts. Experiments show our approach achieves average improvements of 28.38% in CodeBLEU (code translation), 58.11% in ROUGE-L (code summarization), and 84.53% in SuccessRate@1 (API recommendation) over basic prompts. To validate its effectiveness in an industrial scenario, we evaluate our approach on WeChat-Bench, a proprietary dataset, achieving an average MRR improvement of 148.89% for API recommendation.

Paper Structure

This paper contains 27 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An example of designed prompt for LCM to generate the summary of given code snippet.
  • Figure 2: The overview of APG Methods: Instruction Generation and Multi-step Reasoning.
  • Figure 3: An example of APE and OPRO generated instructions for the code translation task using Deepseek-Coder.
  • Figure 4: Results of multi-step reasoning on three code intelligence tasks. The vertical axis means the average CV of each metric.
  • Figure 5: An example of APE-CoT prompt in code translation.
  • ...and 1 more figures