Table of Contents
Fetching ...

In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement

Zhen-Yu Zhang, Jiandong Zhang, Huaxiu Yao, Gang Niu, Masashi Sugiyama

TL;DR

PAPO introduces a test-time, non-retraining refinement framework that jointly optimizes prompts and pseudo-supervision using in-context demonstrations drawn from unsupervised downstream tasks. By translating gradient signals into textual critiques via TextGrad, PAPO iteratively refines both the prompts and the pseudo-labels to improve generation quality while mitigating overfitting through a clustering/multi-manifold regularization effect. The approach is validated on QA, NLI benchmarks, and a real-world molecule optimization task, where PAPO consistently outperforms baselines and demonstrates favorable trade-offs between performance and compute. The results suggest that leveraging the entire pseudo-supervised data with in-context learning, rather than relying solely on high-confidence subsets, yields more robust improvements suitable for practical deployment and downstream fine-tuning.

Abstract

Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality. Most existing methods rely on human supervision or parameter retraining, both of which are costly in terms of data collection and computational resources. To handle these challenges, a direct solution is to generate ``high-confidence'' data from unsupervised downstream tasks and use them for in-context prompting or prompt optimization to refine the pseudo-supervision. However, relying solely on such data may lead to overfitting. In this paper, we leverage the in-context learning (ICL) abilities of LLMs and propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision. The proposed learning objective ensures that the optimized prompt guides the LLM to generate consistent responses for a given input when pseudo-supervised data from the downstream task are used as demonstrations, enabling refinement over the entire pseudo-supervision. The prompt is optimized by translating gradient signals into textual critiques, which serve as feedback to iteratively refine the prompt and model responses. Theoretical analysis in a simplified classification setting shows that the refined pseudo-supervision exhibits a geometric clustering structure, helping to mitigate overfitting. Experiments on question answering, natural language inference benchmarks, and a real-world molecule optimization task, show the effectiveness of the proposed algorithm.

In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement

TL;DR

PAPO introduces a test-time, non-retraining refinement framework that jointly optimizes prompts and pseudo-supervision using in-context demonstrations drawn from unsupervised downstream tasks. By translating gradient signals into textual critiques via TextGrad, PAPO iteratively refines both the prompts and the pseudo-labels to improve generation quality while mitigating overfitting through a clustering/multi-manifold regularization effect. The approach is validated on QA, NLI benchmarks, and a real-world molecule optimization task, where PAPO consistently outperforms baselines and demonstrates favorable trade-offs between performance and compute. The results suggest that leveraging the entire pseudo-supervised data with in-context learning, rather than relying solely on high-confidence subsets, yields more robust improvements suitable for practical deployment and downstream fine-tuning.

Abstract

Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality. Most existing methods rely on human supervision or parameter retraining, both of which are costly in terms of data collection and computational resources. To handle these challenges, a direct solution is to generate ``high-confidence'' data from unsupervised downstream tasks and use them for in-context prompting or prompt optimization to refine the pseudo-supervision. However, relying solely on such data may lead to overfitting. In this paper, we leverage the in-context learning (ICL) abilities of LLMs and propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision. The proposed learning objective ensures that the optimized prompt guides the LLM to generate consistent responses for a given input when pseudo-supervised data from the downstream task are used as demonstrations, enabling refinement over the entire pseudo-supervision. The prompt is optimized by translating gradient signals into textual critiques, which serve as feedback to iteratively refine the prompt and model responses. Theoretical analysis in a simplified classification setting shows that the refined pseudo-supervision exhibits a geometric clustering structure, helping to mitigate overfitting. Experiments on question answering, natural language inference benchmarks, and a real-world molecule optimization task, show the effectiveness of the proposed algorithm.
Paper Structure (30 sections, 2 theorems, 18 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 30 sections, 2 theorems, 18 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

For any transformer with layer $L \geq 1$, under the same setting as Theorem G.1 in bai2023transformers, the $(2L)$-layer transformer $TF_{\theta}$ there approximates the true gradient descent trajectory $\{\mathbf{w}^{\ell}_{\textnormal{GD}}\}_{\ell \geq 0}$: For the intermediate iterates $\{\wideh where $L_{f} = \sup_{\mathbf{w}\in\mathcal{W}}\|\nabla^2\widehat{L}_N(\mathbf{w})\|_{\textnormal{op

Figures (5)

  • Figure 1: Comparison between training-time optimization (e.g., RLHF and Self-Refine) and test-time optimization with or without human supervision (e.g., Test-time Alignment and PAPO). PAPO enables test-time refinement without retraining model parameters or requiring human supervision.
  • Figure 2: An illustration of the PAPO algorithm.
  • Figure 3: Ablation studies of the PAPO algorithm.
  • Figure 4: Vina score and QED score of the molecules refined by PAPO and Auto-CoT compared to clinically approved compounds. The molecule refined by PAPO exhibits greater structural similarity to its closest approved counterpart while achieving better QED and Vina scores.
  • Figure : Pseudo-supervised-demonstrations Aligned Prompt Optimization (PAPO)

Theorems & Definitions (2)

  • Lemma 1: Corollary G.1 in bai2023transformers
  • Theorem 1