Table of Contents
Fetching ...

Understanding Prompt Tuning and In-Context Learning via Meta-Learning

Tim Genewein, Li Kevin Wenliang, Jordi Grau-Moya, Anian Ruoss, Laurent Orseau, Marcus Hutter

TL;DR

This work frames prompting as Bayesian conditioning of a memory-based meta-learned predictor, arguing that optimal prompting exists when the target task lies within the pretraining meta-distribution and that meta-trained networks implement Bayes-optimal in-context adaptation via their activations. It formalizes prefix-tuning and characterizes conditions under which prompt-based adaptation can reach Bayes-optimal performance, highlighting two failure modes: multimodal target distributions and genuinely novel atomic tasks. Through educational experiments on LSTMs and Transformers using coin-flip data, it shows that soft prefixes can nearly achieve Bayes-optimal predictions for single-task targets and can even influence untrained networks, whereas prefix prompting struggles with task mixtures and weight-tuning can overcome these limits. The results provide a principled understanding of in-context learning, quantify fundamental prompting limits, and offer guidance on when to favor weight-tuning or soft prompts in practice.

Abstract

Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.

Understanding Prompt Tuning and In-Context Learning via Meta-Learning

TL;DR

This work frames prompting as Bayesian conditioning of a memory-based meta-learned predictor, arguing that optimal prompting exists when the target task lies within the pretraining meta-distribution and that meta-trained networks implement Bayes-optimal in-context adaptation via their activations. It formalizes prefix-tuning and characterizes conditions under which prompt-based adaptation can reach Bayes-optimal performance, highlighting two failure modes: multimodal target distributions and genuinely novel atomic tasks. Through educational experiments on LSTMs and Transformers using coin-flip data, it shows that soft prefixes can nearly achieve Bayes-optimal predictions for single-task targets and can even influence untrained networks, whereas prefix prompting struggles with task mixtures and weight-tuning can overcome these limits. The results provide a principled understanding of in-context learning, quantify fundamental prompting limits, and offer guidance on when to favor weight-tuning or soft prompts in practice.

Abstract

Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.

Paper Structure

This paper contains 43 sections, 10 equations, 12 figures.

Figures (12)

  • Figure 1: Pretraining on sequences from coins with uniform random bias (length ${N_\text{train}=100}$), then fine-tuning to the target task of a single coin with bias $0.2$ (tuning sequence length ${N_\text{tune}=50}$). Plots show prediction performance on the target task for different prefix- and weight-tuning methods. For both Transformers and LSTMs Soft Prompting ('SoftPT') leads to optimal performance, showing that networks can be successfully prompted to behave Bayes-optimally on the target distribution ('TargetBayes'). This holds up to the tuning sequence length ($50$), with only minor degradations up to $200$ steps. The corresponding soft prefixes of length$=6$ outperform even the best hard token prefixes of the same length ('HardPF'). Most weight-tuning methods also perform very well. See \ref{['sec:experiments']} for method details. Thick lines and bars show the median over $10$ tuning repetitions, thin lines individual repetitions, and shaded areas/bars show $25\%, 75\%$ quantiles. See \ref{['fig:main_result_R2S_internal']} for a visualization of models' internal dynamics. Regret curves for the LSTM, similar to top-left panel, are shown in \ref{['fig:regret_R2S_app']}.
  • Figure 2: 2D PCA projection of Transformer's (top) and LSTM's (bottom) internal state (= activations), illustrating how differently tuned prefixes affect state and subsequent dynamics. \ref{['fig:internal_states_R2M_app']} shows that the vertical principal component corresponds to the step $n$, and the horizontal to the heads-to-tails ratio. Colored lines are sequences from the target distribution (single coin with bias $0.2$), gray lines are from the pretraining distribution (uniform random). The off-distribution nature of soft prefixes is particularly visible for the Real- and Soft-prefix for the Transformer. See \ref{['fig:main_result_R2S']} for regret curves.
  • Figure 3: Models pretrained on sequences from coins with uniform random bias (length ${N_\text{train}=100}$) are fine-tuned to the target task of a mixture of two coins (tuning sequence length ${N_\text{tune}=50}$). No prefix-tuning method (with prefixes of length $6$) can achieve optimal performance on the target task ('TargetBayes' is optimal). Full weight-tuning, LoRA (on the Transformer) and two of the embedding tuning variants on the LSTM do reach optimality (even beyond the tuning length of $50$ steps). See \ref{['fig:internal_states_R2M_app']} for a visualization of how different prefixes affect models' internal dynamics. Regret curves for the LSTM, similar to Top left panel, are shown in \ref{['fig:regret_R2M_app']}.
  • Figure 4: Untrained Transformer tuned to the Two-Coin mixture (left) and to Random Coins (right); tuning sequence length ${N_\text{tune}=50}$. In both cases, Soft Prompting is the only effective prefix-tuning method. It nearly reaches Bayes-optimality ('TargetBayes', which is a Laplace predictor on Random Coins). Performance degrades rapidly after the tuning sequence length. Full regret curves (and LSTM results) in \ref{['fig:regret_U2M_app']} and \ref{['fig:regret_U2R_app']}. Among the weight-tuning methods, LoRA is very effective.
  • Figure A5: Models pretrained on Random Coins are tuned to a Single Coin. Transformer shown in the left column, LSTM shown in the right column. Of the prefix-tuning methods, only Soft Prompting ('SoftPT') allows optimal target task performance. Several of the weight-tuning methods succeed.
  • ...and 7 more figures