Table of Contents
Fetching ...

Automatic Prompt Selection for Large Language Models

Viet-Tung Do, Van-Khanh Hoang, Duy-Hung Nguyen, Shahab Sabahi, Jeff Yang, Hajime Hotta, Minh-Tien Nguyen, Hung Le

TL;DR

This work addresses the challenge of designing effective prompts for large language models by proposing Automatic Prompt Selection (APS), a discrete, three‑step pipeline that (i) builds a diverse prompt database through cluster‑based generation, (ii) trains a lightweight prompt evaluator to rank prompts without further LLM calls, and (iii) selects the top prompts (with optional voting) to solve zero‑shot QA tasks. By combining cluster‑level prompt generation with a trainable evaluator, APS achieves competitive accuracy on GSM8K, MultiArith, and AQuA while reducing the computational burden compared to per‑prompt querying. The approach yields notable gains over strong baselines and existing automatic prompting methods, demonstrating the practicality of prompt ranking over a fixed database for robust, scalable prompt optimization. Limitations include potential prompt duplication and tuning overhead, with future work aimed at extending APS to few‑shot in‑context learning and broader NLP tasks, while considering ethical implications of automated prompt engineering.

Abstract

Large Language Models (LLMs) can perform various natural language processing tasks with suitable instruction prompts. However, designing effective prompts manually is challenging and time-consuming. Existing methods for automatic prompt optimization either lack flexibility or efficiency. In this paper, we propose an effective approach to automatically select the optimal prompt for a given input from a finite set of synthetic candidate prompts. Our approach consists of three steps: (1) clustering the training data and generating candidate prompts for each cluster using an LLM-based prompt generator; (2) synthesizing a dataset of input-prompt-output tuples for training a prompt evaluator to rank the prompts based on their relevance to the input; (3) using the prompt evaluator to select the best prompt for a new input at test time. Our approach balances prompt generality-specificity and eliminates the need for resource-intensive training and inference. It demonstrates competitive performance on zero-shot question-answering datasets: GSM8K, MultiArith, and AQuA.

Automatic Prompt Selection for Large Language Models

TL;DR

This work addresses the challenge of designing effective prompts for large language models by proposing Automatic Prompt Selection (APS), a discrete, three‑step pipeline that (i) builds a diverse prompt database through cluster‑based generation, (ii) trains a lightweight prompt evaluator to rank prompts without further LLM calls, and (iii) selects the top prompts (with optional voting) to solve zero‑shot QA tasks. By combining cluster‑level prompt generation with a trainable evaluator, APS achieves competitive accuracy on GSM8K, MultiArith, and AQuA while reducing the computational burden compared to per‑prompt querying. The approach yields notable gains over strong baselines and existing automatic prompting methods, demonstrating the practicality of prompt ranking over a fixed database for robust, scalable prompt optimization. Limitations include potential prompt duplication and tuning overhead, with future work aimed at extending APS to few‑shot in‑context learning and broader NLP tasks, while considering ethical implications of automated prompt engineering.

Abstract

Large Language Models (LLMs) can perform various natural language processing tasks with suitable instruction prompts. However, designing effective prompts manually is challenging and time-consuming. Existing methods for automatic prompt optimization either lack flexibility or efficiency. In this paper, we propose an effective approach to automatically select the optimal prompt for a given input from a finite set of synthetic candidate prompts. Our approach consists of three steps: (1) clustering the training data and generating candidate prompts for each cluster using an LLM-based prompt generator; (2) synthesizing a dataset of input-prompt-output tuples for training a prompt evaluator to rank the prompts based on their relevance to the input; (3) using the prompt evaluator to select the best prompt for a new input at test time. Our approach balances prompt generality-specificity and eliminates the need for resource-intensive training and inference. It demonstrates competitive performance on zero-shot question-answering datasets: GSM8K, MultiArith, and AQuA.
Paper Structure (28 sections, 7 equations, 1 figure, 6 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 1 figure, 6 tables, 1 algorithm.

Figures (1)

  • Figure 1: Automatic Prompt Selection (APS) has three steps. (1) Prompt Database Generation: We cluster training data, use LLM-based prompt generator $\mathtt{A}_1$ for diverse prompts in each group, and combine them into a versatile prompt database. (2) Prompt Evaluator Training: We query a data generation LLM $\mathtt{A}_2$ with generated prompts and training inputs to generate tuples: input ($q,c$), prompt ($p$), LLM output ($a'$), and ground-truth output ($a'$). Training the prompt evaluator $\mathtt{E}$ on this dataset, we adopt a preference loss, encouraging high scores for good prompts and low scores for bad ones. (3) Prompt Ranking: During inference, given a testing input, we pick the highest-scoring prompt from the database with the help of the prompt evaluator. The selected prompt (highlighted with red border) will be used with a downstream LLM $\mathtt{M}$ to compute the final output $a'$.