Table of Contents
Fetching ...

Self-Supervised Prompt Optimization

Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Xinbing Liang, Fengwei Teng, Jinhao Tu, Fashen Ren, Xiangru Tang, Sirui Hong, Chenglin Wu, Yuyu Luo

TL;DR

This work tackles the reliance on external references in prompt optimization by introducing Self-Supervised Prompt Optimization (SPO), a reference-free framework that uses pairwise output comparisons judged by an LLM to guide prompt improvement. SPO implements an Optimize-Execute-Evaluate loop where outputs serve as both evaluation references and optimization signals, achieving state-of-the-art or competitive performance on both closed benchmarks and open-ended MT-Bench tasks at a fraction of prior costs. The approach demonstrates strong cost-efficiency (as low as $0.15 per dataset, 1.1%–5.6% of competing methods) and robustness across optimization/evaluation/execution model configurations, with ablation analyses confirming the effectiveness of small sample sizes and moderate iteration counts. Limitations include potential biases from the evaluation model and a focus on single-model optimization, suggesting avenues for cross-model transfer and bias mitigation in future work.

Abstract

Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/FoundationAgents/SPO.

Self-Supervised Prompt Optimization

TL;DR

This work tackles the reliance on external references in prompt optimization by introducing Self-Supervised Prompt Optimization (SPO), a reference-free framework that uses pairwise output comparisons judged by an LLM to guide prompt improvement. SPO implements an Optimize-Execute-Evaluate loop where outputs serve as both evaluation references and optimization signals, achieving state-of-the-art or competitive performance on both closed benchmarks and open-ended MT-Bench tasks at a fraction of prior costs. The approach demonstrates strong cost-efficiency (as low as $0.15 per dataset, 1.1%–5.6% of competing methods) and robustness across optimization/evaluation/execution model configurations, with ablation analyses confirming the effectiveness of small sample sizes and moderate iteration counts. Limitations include potential biases from the evaluation model and a focus on single-model optimization, suggesting avenues for cross-model transfer and bias mitigation in future work.

Abstract

Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/FoundationAgents/SPO.

Paper Structure

This paper contains 42 sections, 3 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of Prompt Optimization Methods. (a) illustrates the traditional prompt optimization process with external reference, where feedback from the ground truth of humans is used to iteratively improve the best prompt. (b) presents our proposed self-supervised prompt optimization, which utilizes pairwise comparisons of LLM's own outputs to optimize prompts without relying on external reference.
  • Figure 2: Comparison of Performance ($y$-axis) and Optimization Costs in Dollars ($x$-axis) across Six Prompt Optimization Methods.SPO demonstrates competitive performance, consistently ranking among the top two methods while maintaining significantly lower costs (ranging from 1.1% to 5.6% of the costs incurred by other methods) across all datasets.
  • Figure 3: Components of the Evaluation Framework for Prompt Optimization. (a) Evaluation Sources: Compares different outputs, including ground truth and model-generated outputs, to assess quality. (b) Evaluation Methods: Showcases various evaluation techniques, including benchmark comparisons, LLM-as-a-Judge, and human feedback. (c) Feedback Types: Showcases a range of feedback. For clarity, the rank-signal example now compares only Output A and Output B. The blue in (a), (b), and (c) indicate the specific evaluation approach selected for SPO.
  • Figure 4: A Running Example of SPO Framework: Pairwise evaluation on the outputs selects the better one from corresponding prompts. The best output and prompt pair are highlited with pentagrams, which will be updated after evaluation. Furthermore, using a case from MT-bench, we show the complete process of SPO's $\phi_{opt}$, $\phi_{exe}$, and $\phi_{eval}$ and corresponding prompt.
  • Figure 5: Win rates comparison between different LLMs and SPO across three tasks. The heatmap shows pairwise win rates (%) where each cell represents the row model's win rate against the column model. Models tested include Claude-3.5-Sonnet, DeepSeek-V3, and GPT-4o-mini. Models are evaluated both in IO (top three rows) and after SPO optimization (bottom three rows). Win rates range from 0% to 100%, with higher percentages indicating better performance.
  • ...and 2 more figures