Automatic Prompt Selection for Large Language Models
Viet-Tung Do, Van-Khanh Hoang, Duy-Hung Nguyen, Shahab Sabahi, Jeff Yang, Hajime Hotta, Minh-Tien Nguyen, Hung Le
TL;DR
This work addresses the challenge of designing effective prompts for large language models by proposing Automatic Prompt Selection (APS), a discrete, three‑step pipeline that (i) builds a diverse prompt database through cluster‑based generation, (ii) trains a lightweight prompt evaluator to rank prompts without further LLM calls, and (iii) selects the top prompts (with optional voting) to solve zero‑shot QA tasks. By combining cluster‑level prompt generation with a trainable evaluator, APS achieves competitive accuracy on GSM8K, MultiArith, and AQuA while reducing the computational burden compared to per‑prompt querying. The approach yields notable gains over strong baselines and existing automatic prompting methods, demonstrating the practicality of prompt ranking over a fixed database for robust, scalable prompt optimization. Limitations include potential prompt duplication and tuning overhead, with future work aimed at extending APS to few‑shot in‑context learning and broader NLP tasks, while considering ethical implications of automated prompt engineering.
Abstract
Large Language Models (LLMs) can perform various natural language processing tasks with suitable instruction prompts. However, designing effective prompts manually is challenging and time-consuming. Existing methods for automatic prompt optimization either lack flexibility or efficiency. In this paper, we propose an effective approach to automatically select the optimal prompt for a given input from a finite set of synthetic candidate prompts. Our approach consists of three steps: (1) clustering the training data and generating candidate prompts for each cluster using an LLM-based prompt generator; (2) synthesizing a dataset of input-prompt-output tuples for training a prompt evaluator to rank the prompts based on their relevance to the input; (3) using the prompt evaluator to select the best prompt for a new input at test time. Our approach balances prompt generality-specificity and eliminates the need for resource-intensive training and inference. It demonstrates competitive performance on zero-shot question-answering datasets: GSM8K, MultiArith, and AQuA.
