Table of Contents
Fetching ...

Universal Self-Adaptive Prompting

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun Dai, Julian Martin Eisenschlos, Sercan O. Arik, Tomas Pfister

TL;DR

USP addresses the challenge of weak zero-shot performance by automatically designing prompts without labeled data. It introduces a two-stage prompting framework that first generates candidate pseudo-demos from unlabeled data and then uses a task-specific selector to assemble a small, high-quality set of demonstrations for a final prompt. By categorizing tasks into CLS, SFG, and LFG and tailoring scoring functions for each, USP achieves substantial gains over standard zero-shot prompting and often matches or surpasses few-shot baselines across 40+ tasks on PaLM, PaLM 2, and BBH benchmarks. The approach is lightweight, black-box, and versatile, enabling robust zero-shot prompting in diverse NLP tasks with practical implications for scalable deployment of large language models.

Abstract

A hallmark of modern large language models (LLMs) is their impressive general zero-shot and few-shot abilities, often elicited through in-context learning (ICL) via prompting. However, while highly coveted and being the most general, zero-shot performances in LLMs are still typically weaker due to the lack of guidance and the difficulty of applying existing automatic prompt design methods in general tasks when ground-truth labels are unavailable. In this study, we address this by presenting Universal Self-Adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning (while compatible with few-shot). Requiring only a small amount of unlabeled data and an inference-only LLM, USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types and then uses a corresponding selector to select the most suitable queries and zero-shot model-generated responses as pseudo-demonstrations, thereby generalizing ICL to the zero-shot setup in a fully automated way. We evaluate USP with PaLM and PaLM 2 models and demonstrate performances that are considerably stronger than standard zero-shot baselines and often comparable to or even superior to few-shot baselines across more than 40 natural language understanding, natural language generation, and reasoning tasks.

Universal Self-Adaptive Prompting

TL;DR

USP addresses the challenge of weak zero-shot performance by automatically designing prompts without labeled data. It introduces a two-stage prompting framework that first generates candidate pseudo-demos from unlabeled data and then uses a task-specific selector to assemble a small, high-quality set of demonstrations for a final prompt. By categorizing tasks into CLS, SFG, and LFG and tailoring scoring functions for each, USP achieves substantial gains over standard zero-shot prompting and often matches or surpasses few-shot baselines across 40+ tasks on PaLM, PaLM 2, and BBH benchmarks. The approach is lightweight, black-box, and versatile, enabling robust zero-shot prompting in diverse NLP tasks with practical implications for scalable deployment of large language models.

Abstract

A hallmark of modern large language models (LLMs) is their impressive general zero-shot and few-shot abilities, often elicited through in-context learning (ICL) via prompting. However, while highly coveted and being the most general, zero-shot performances in LLMs are still typically weaker due to the lack of guidance and the difficulty of applying existing automatic prompt design methods in general tasks when ground-truth labels are unavailable. In this study, we address this by presenting Universal Self-Adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning (while compatible with few-shot). Requiring only a small amount of unlabeled data and an inference-only LLM, USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types and then uses a corresponding selector to select the most suitable queries and zero-shot model-generated responses as pseudo-demonstrations, thereby generalizing ICL to the zero-shot setup in a fully automated way. We evaluate USP with PaLM and PaLM 2 models and demonstrate performances that are considerably stronger than standard zero-shot baselines and often comparable to or even superior to few-shot baselines across more than 40 natural language understanding, natural language generation, and reasoning tasks.
Paper Structure (40 sections, 8 equations, 9 figures, 11 tables, 1 algorithm)

This paper contains 40 sections, 8 equations, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: We propose USP, a versatile zero-shot prompting method that improves over standard zero-shot prompting across more than 40 Classification (CLS), Short-form Generation (SFG) and Long-form Generation (LFG) tasks (see §\ref{['subsec:task_specific_selector']} for further explanations in PaLM-62B, PaLM-540B and PaLM 2 models.
  • Figure 2: Overview of (a) zero-shot setup, (b) few-shot setup with in-context learning, (c) Consistency-based Self-Adaptive Prompting wan2023better and (d) Universal Self-Adaptive Prompting, or USP, the proposed method in this work. The queries without demos with which LLMs are directly prompted (zero-shot, or Stage 1 in COSP and USP) are marked in red arrows, and the queries prepended with either the handcrafted demos (few-shot) or model-generated pseudo-demos (Stage 2 in COSP and USP) are marked in blue arrows.
  • Figure 3: Accuracy on BIG-Bench Hard tasks with PaLM 2-M (each line represents a task of the suite -- refer to App. \ref{['app:datasets_models']} for full details). The gain/loss of USP over standard 0-shot is shown in percentages. Note that 3 (pseudo-)demos are generated per query following anil2023palm. Human refers to average human performance from suzgun2022challenging.
  • Figure 4: USP picks confident predictions that are more likely better. Ground-truth performance metrics in the Stage 1 unlabelled samples ($\mathcal{D}$) against USP scores in selected tasks with PaLM-540B: $\mathcal{F}_{\texttt{CLS}}$ against accuracy (CLS), $\mathcal{F}_{\texttt{SFG}}$ against EM (SFG), and $\mathcal{F}_{\texttt{CLS}}$ against ROUGE-LSum (LFG). CLS: single-sample accuracy is binary and we discretize $\mathcal{F}_{\texttt{CLS}}$ into 10 deciles & show the mean acc. $\pm$ 1 sem in each bin. SFG: Same as CLS, except that $\mathcal{F}_{\texttt{SFG}}$ is already discrete & no further discretization is performed; marker sizes are proportional to numbers of samples of each $\mathcal{F}_{\texttt{SFG}}$ value. LFG: Both the evaluation metric and $\mathcal{F}_{\texttt{LFG}}$ are continuous and we plot all data without aggregation -- since we query each $d^{(j)} \in \mathcal{D}$ 6 times, we show the mean $\pm$sem ground-truth ROUGE score for each $d^{(j)}$; gray $\times$ markers denote outliers. The overall mean performance over $\mathcal{D}$ (gray dashed lines) and linear trend lines & confidence intervals are shown in all plots. More results are provided in the App. \ref{['subapp:score_ground_truth_comparison']}.
  • Figure 5: Gain from USP is larger with higher zero-shot uncertainty. Relative gain of Stage 2 over Stage 1 accuracy/EM in PaLM-540B/CLS tasks (left) & PaLM 2-M/BBH tasks (right) against average USP score: $\mathbb{E}_{z \sim \mathcal{D}} [\mathcal{F}_{\texttt{CLS/SFG}}(z)]$. A higher average USP score denotes lower zero-shot uncertainty. Trend lines and confidence intervals (shades) are shown.
  • ...and 4 more figures