Table of Contents
Fetching ...

GPS: General Per-Sample Prompter

Pawel Batorski, Paul Swoboda

TL;DR

GPS introduces a general-purpose, per-sample prompter trained with reinforcement learning to generate input-specific prompts for unseen tasks without task-specific data. It combines a trainable Prompt Generator, a frozen Evaluator Model, two regularization schemes to prevent leakage, and Minimum Bayes Risk decoding to stabilize inference, achieving competitive results across summarization, simplification, classification, and GSM8K. The work emphasizes zero-shot adaptability and cross-task generalization, showing that per-input prompts can outperform many task-specific prompting methods without large, curated datasets. This paradigm points to practical, on-demand prompting for diverse NLP tasks and motivates further refinements in regularization and sample-efficient training.

Abstract

LLMs are sensitive to prompting, with task performance often hinging on subtle, sometimes imperceptible variations in phrasing. As a result, crafting effective prompts manually remains challenging and time-consuming. Recent automatic prompting methods mitigate this difficulty but face three key limitations: (i) for each new task, they require large datasets to train good prompts;(ii) they rely on costly optimization loops that may take hours; (iii)they typically produce a single task-level prompt that does not adapt to the individual input problem to be solved. We propose GPS, the first general-purpose, per-sample prompting method. Without any task-specific tuning, GPS generates a tailored prompt for each unseen input, improving performance across diverse tasks. The prompter is trained with reinforcement learning on a suite of training tasks and includes a novel regularization for effectively adapting to per-sample prompting. Finally, we employ Minimum Bayes Risk decoding to stabilize inference. Empirically, GPS demonstrates competitive performance: we attain second best results among baselines on text simplification, third best results on summarization and on-par results on classification, while not training on any of these tasks, in contrast to the baselines. For in-domain prompting, we obtain sota on GSM8K. Our work shows the potential of a novel and effective paradigm for automatic prompting: generating adaptive, input-specific prompts without extensive optimization and without access to a task-specific training set. Our code is available at https://github.com/Batorskq/GPS.

GPS: General Per-Sample Prompter

TL;DR

GPS introduces a general-purpose, per-sample prompter trained with reinforcement learning to generate input-specific prompts for unseen tasks without task-specific data. It combines a trainable Prompt Generator, a frozen Evaluator Model, two regularization schemes to prevent leakage, and Minimum Bayes Risk decoding to stabilize inference, achieving competitive results across summarization, simplification, classification, and GSM8K. The work emphasizes zero-shot adaptability and cross-task generalization, showing that per-input prompts can outperform many task-specific prompting methods without large, curated datasets. This paradigm points to practical, on-demand prompting for diverse NLP tasks and motivates further refinements in regularization and sample-efficient training.

Abstract

LLMs are sensitive to prompting, with task performance often hinging on subtle, sometimes imperceptible variations in phrasing. As a result, crafting effective prompts manually remains challenging and time-consuming. Recent automatic prompting methods mitigate this difficulty but face three key limitations: (i) for each new task, they require large datasets to train good prompts;(ii) they rely on costly optimization loops that may take hours; (iii)they typically produce a single task-level prompt that does not adapt to the individual input problem to be solved. We propose GPS, the first general-purpose, per-sample prompting method. Without any task-specific tuning, GPS generates a tailored prompt for each unseen input, improving performance across diverse tasks. The prompter is trained with reinforcement learning on a suite of training tasks and includes a novel regularization for effectively adapting to per-sample prompting. Finally, we employ Minimum Bayes Risk decoding to stabilize inference. Empirically, GPS demonstrates competitive performance: we attain second best results among baselines on text simplification, third best results on summarization and on-par results on classification, while not training on any of these tasks, in contrast to the baselines. For in-domain prompting, we obtain sota on GSM8K. Our work shows the potential of a novel and effective paradigm for automatic prompting: generating adaptive, input-specific prompts without extensive optimization and without access to a task-specific training set. Our code is available at https://github.com/Batorskq/GPS.

Paper Structure

This paper contains 34 sections, 3 equations, 4 figures, 5 tables, 3 algorithms.

Figures (4)

  • Figure 1: Left: Comparison of existing works to GPS. We propose the first automatic prompting method that is (i) general purpose, i.e. works without a task-specific training set and task-specific training and (ii) improves upon user-given prompts through refinement on a per-sample basis. Right: Overview of GPS, a general, per-sample prompter trained on mathematical, logical, and programming tasks. Once trained, it generates out-of-domain prompts for classification, summarization, and simplification. The model operates in a per-sample regime, producing a unique prompt for each input.
  • Figure 2: Training cycle of GPS. First, the Prompt Generator produces an initial prompt based on the given observation. This prompt is then regularized using either Judge Regularization or Sample Regularization to prevent label leakage, i.e., the inclusion of the correct answer within the prompt itself. The Evaluator then assesses the quality of the regularized prompt by measuring its accuracy and provides a reward signal. Finally, the model is updated based on this feedback to improve prompt quality over time.
  • Figure 3: Comparison of accuracy on the DeepMath benchmark across different regularization strategies and evaluator sizes.
  • Figure 5: Comparison between a flawed and a regularized prompt setup for subjectivity classification. The observation is the actual user input. The leakage prompt embeds the correct answer within an example that mirrors the test input, effectively leaking the label into the prompt. This kind of leakage compromises evaluation integrity, as it allows the model to extract or memorize the answer without performing the task. The regularized prompt, on the other hand, avoids including the target label and better reflects a fair testing setup. The ground truth shows the expected model output. Regularization techniques are essential for mitigating this type of leakage and ensuring reliable performance evaluation.