Table of Contents
Fetching ...

Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction

Junlang Qian, Zixiao Zhu, Hanzhang Zhou, Zijian Feng, Zepeng Zhai, Kezhi Mao

TL;DR

The paper addresses prompt brittleness in zero-shot text classification by moving beyond sole reliance on next-token predictions and introducing Placeholding Parallel Prediction ($\mathcal{P}^3$). $\mathcal{P}^3$ enables multiple subsequent token predictions in a single LM run by appending placeholder tokens and aggregating predictions, reducing sensitivity to prompt wording while maintaining efficiency. Empirical results across seven datasets with LLaMA2-13B/70B demonstrate substantial reductions in cross-prompt variance and improved accuracy, with strong performance even without prompts. This approach significantly lowers the need for prompt engineering, enhances robustness, and offers a scalable, efficient pathway for robust zero-shot classification in practical deployments.

Abstract

Zero-shot text classification typically relies on prompt engineering, but the inherent prompt brittleness of large language models undermines its reliability. Minor changes in prompt can cause significant discrepancies in model performance. We attribute this prompt brittleness largely to the narrow focus on nexttoken probabilities in existing methods. To address this, we propose Placeholding Parallel Prediction (P3), a novel approach that predicts token probabilities across multiple positions and simulates comprehensive sampling of generation paths in a single run of a language model. Experiments show improved accuracy and up to 98% reduction in the standard deviation across prompts, boosting robustness. Even without a prompt, P3 maintains comparable performance, reducing the need for prompt engineering.

Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction

TL;DR

The paper addresses prompt brittleness in zero-shot text classification by moving beyond sole reliance on next-token predictions and introducing Placeholding Parallel Prediction (). enables multiple subsequent token predictions in a single LM run by appending placeholder tokens and aggregating predictions, reducing sensitivity to prompt wording while maintaining efficiency. Empirical results across seven datasets with LLaMA2-13B/70B demonstrate substantial reductions in cross-prompt variance and improved accuracy, with strong performance even without prompts. This approach significantly lowers the need for prompt engineering, enhances robustness, and offers a scalable, efficient pathway for robust zero-shot classification in practical deployments.

Abstract

Zero-shot text classification typically relies on prompt engineering, but the inherent prompt brittleness of large language models undermines its reliability. Minor changes in prompt can cause significant discrepancies in model performance. We attribute this prompt brittleness largely to the narrow focus on nexttoken probabilities in existing methods. To address this, we propose Placeholding Parallel Prediction (P3), a novel approach that predicts token probabilities across multiple positions and simulates comprehensive sampling of generation paths in a single run of a language model. Experiments show improved accuracy and up to 98% reduction in the standard deviation across prompts, boosting robustness. Even without a prompt, P3 maintains comparable performance, reducing the need for prompt engineering.

Paper Structure

This paper contains 28 sections, 31 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: An example of prompt brittleness. The prompt "It is a _" yields a notably high score for "tree", while "It is an _" overwhelmingly favors "insect". The percentage scores are normalized for an arbitrary text unrelated to any class $\mathcal{T}$ = "knows grammar."
  • Figure 2: Accuracies of plausible prompts using tokens at the first three positions. Each row corresponds to a position: the $\bullet$ green dots represent the next token (the 0th position), and the $\bullet$ blue and $\bullet$ red dots represent the 1st and 2nd tokens. Each dot denotes a prompt, and its horizontal coordinate indicates its performance.
  • Figure 3: Next-Token Prediction versus Placeholding Parallel Prediction. Our proposed $\mathcal{P}^\mathit{3}$ obtains multiple token predictions in a single language model run.
  • Figure 4: (a) Next-Token Prediction. (b) Placeholding Skipping Prediction ($\mathcal{PSP}$). (c) Placeholding Parallel Prediction ($\mathcal{P}^\mathit{3}$). The small green rectangles indicate the output tokens to be used, and the grey ones indicate those not to be used. <ph> represents a placeholder token.
  • Figure 5: Average accuracy and average cross-prompt standard deviation (i.e., prompt brittleness) across seven datasets of $\mathcal{P}^\mathit{3}$. The horizontal axis represents the hyperparameter $\eta$, where $\eta=0$ corresponds to the next-token prediction results, and larger $\eta$ values indicate consideration of more distant token positions. $\eta>0$ shows higher accuracy and lower standard deviation compared to next-token prediction.
  • ...and 3 more figures