The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

Shuzheng Gao; Chaozheng Wang; Cuiyun Gao; Xiaoqian Jiao; Chun Yong Chong; Shan Gao; Michael Lyu

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, Michael Lyu

TL;DR

MAPS addresses suboptimal LLM-driven test-case generation by automatically tailoring prompts to each model. It integrates domain contextual knowledge extraction, diversity-guided prompt generation, and failure-driven rule induction to create robust, model-specific prompts that steer LLMs toward higher-coverage test cases. Empirical results on Defects4J across ChatGPT, Llama-3.1, and Qwen2 show MAPS outperforms state-of-the-art prompt optimizers and manual prompts, with consistent gains in line and branch coverage and clear evidence of cross-model tailoring. The work demonstrates a practical, scalable pathway to leverage LLMs for software testing, with a publicly available replication package and strong potential for extension to other languages and benchmarks.

Abstract

Test cases are essential for validating the reliability and quality of software applications. Recent studies have demonstrated the capability of Large Language Models (LLMs) to generate useful test cases for given source code. However, the existing work primarily relies on human-written plain prompts, which often leads to suboptimal results since the performance of LLMs can be highly influenced by the prompts. Moreover, these approaches use the same prompt for all LLMs, overlooking the fact that different LLMs might be best suited to different prompts. Given the wide variety of possible prompt formulations, automatically discovering the optimal prompt for each LLM presents a significant challenge. Although there are methods on automated prompt optimization in the natural language processing field, they are hard to produce effective prompts for the test case generation task. First, the methods iteratively optimize prompts by simply combining and mutating existing ones without proper guidance, resulting in prompts that lack diversity and tend to repeat the same errors in the generated test cases. Second, the prompts are generally lack of domain contextual knowledge, limiting LLMs' performance in the task.

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

TL;DR

Abstract

Paper Structure (31 sections, 2 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 31 sections, 2 equations, 6 figures, 9 tables, 1 algorithm.

Introduction
Background and Motivating Example
Background
Motivating Examples
Proposed Approach
Overview
Domain Contextual Knowledge Extraction
Diversity-guided Prompt Generation
Failure-driven Rule Induction
Failure Information Selection
Error Reflection
Rule Validation
EXPERIMENTAL Setup
Research Questions
Datasets and Metrics
...and 16 more sections

Figures (6)

Figure 1: Overview of MAPS’s workflow.
Figure 2: An illustration of the format of final prompt and extracted context information.
Figure 3: The prompt templates of MAPS. The complete ones can be found in our replication package replication.
Figure 4: Parameter analysis of number of seed prompts and generated prompts on ChatGPT.
Figure 5: Parameter analysis of the Iteration number.
...and 1 more figures

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

TL;DR

Abstract

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)