TAPO: Task-Referenced Adaptation for Prompt Optimization
Wenxin Luo, Weirui Wang, Xiaopeng Li, Weibo Zhou, Pengyue Jia, Xiangyu Zhao
TL;DR
The paper tackles the inefficiency and lack of task specificity in automated prompt optimization by introducing TAPO, a multitask-aware framework that dynamically selects task-relevant metrics and evaluates prompts with a multi-metric score. TAPO combines a task-driven metric selection module, a metric fusion evaluator with dynamic weights, and an evolution-based prompt optimizer that uses mutation and tournament selection to progressively improve prompts. The core mechanism is the multi-objective score $S(\mathcal{P}) = \sum_{i=1}^{n} w_i \cdot M_i(\mathcal{P})$, which integrates multiple criteria such as similarity, diversity, perplexity, and complexity. Empirical results on six public datasets across GPT-3.5-turbo, GPT-4o, and Llama3-8B-Instruct show TAPO yields strong, task-adaptive performance and robust generalization, with ablations confirming the importance of both multi-metric evaluation and evolution-based optimization; the authors also release open-source code for replication.
Abstract
Prompt engineering can significantly improve the performance of large language models (LLMs), with automated prompt optimization (APO) gaining significant attention due to the time-consuming and laborious nature of manual prompt design. However, much of the existing work in APO overlooks task-specific characteristics, resulting in prompts that lack domain specificity and are not well-suited for task-specific optimization. In this paper, we introduce TAPO, a multitask-aware prompt optimization framework composed of three key modules. First, a task-aware metric selection module is proposed to enhance task-specific prompt generation capabilities. Second, we present a multi-metrics evaluation module to jointly evaluate prompts from multiple perspectives. Third, an evolution-based optimization framework is introduced for automatic prompt refinement, which improves adaptability across various tasks. Extensive experiments on six datasets demonstrate the effectiveness of our approach, and our code is publicly available.
