Table of Contents
Fetching ...

When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations

Aleksandar Petrov, Philip H. S. Torr, Adel Bibi

TL;DR

This work develops a theoretical framework for context-based fine-tuning methods in transformers, clarifying the expressiveness of prompting, soft prompting, and prefix-tuning relative to full fine-tuning. It proves that while soft prompting and prefix-tuning can exploit embedding-space capacity to control model behavior, prefix-tuning cannot alter the intrinsic attention patterns and thus cannot learn completely new tasks, only bias outputs and surface pretrained skills. The authors provide explicit constructions showing exponential generation capacity with a single virtual token and analyze cross-layer effects, explaining why prefix-tuning often fares well on related tasks but struggles with novel ones. The findings have implications for catastrophic forgetting, model alignment, and interpretability, suggesting context-based methods preserve pretrained capabilities while offering limited new capabilities, with practical success arising from skill elicitation and task-combination rather than universal task-learning.

Abstract

Context-based fine-tuning methods, including prompting, in-context learning, soft prompting (also known as prompt tuning), and prefix-tuning, have gained popularity due to their ability to often match the performance of full fine-tuning with a fraction of the parameters. Despite their empirical successes, there is little theoretical understanding of how these techniques influence the internal computation of the model and their expressiveness limitations. We show that despite the continuous embedding space being more expressive than the discrete token space, soft-prompting and prefix-tuning are potentially less expressive than full fine-tuning, even with the same number of learnable parameters. Concretely, context-based fine-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction. This suggests that while techniques like prompting, in-context learning, soft prompting, and prefix-tuning can effectively elicit skills present in the pretrained model, they may not be able to learn novel tasks that require new attention patterns.

When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations

TL;DR

This work develops a theoretical framework for context-based fine-tuning methods in transformers, clarifying the expressiveness of prompting, soft prompting, and prefix-tuning relative to full fine-tuning. It proves that while soft prompting and prefix-tuning can exploit embedding-space capacity to control model behavior, prefix-tuning cannot alter the intrinsic attention patterns and thus cannot learn completely new tasks, only bias outputs and surface pretrained skills. The authors provide explicit constructions showing exponential generation capacity with a single virtual token and analyze cross-layer effects, explaining why prefix-tuning often fares well on related tasks but struggles with novel ones. The findings have implications for catastrophic forgetting, model alignment, and interpretability, suggesting context-based methods preserve pretrained capabilities while offering limited new capabilities, with practical success arising from skill elicitation and task-combination rather than universal task-learning.

Abstract

Context-based fine-tuning methods, including prompting, in-context learning, soft prompting (also known as prompt tuning), and prefix-tuning, have gained popularity due to their ability to often match the performance of full fine-tuning with a fraction of the parameters. Despite their empirical successes, there is little theoretical understanding of how these techniques influence the internal computation of the model and their expressiveness limitations. We show that despite the continuous embedding space being more expressive than the discrete token space, soft-prompting and prefix-tuning are potentially less expressive than full fine-tuning, even with the same number of learnable parameters. Concretely, context-based fine-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction. This suggests that while techniques like prompting, in-context learning, soft prompting, and prefix-tuning can effectively elicit skills present in the pretrained model, they may not be able to learn novel tasks that require new attention patterns.
Paper Structure (39 sections, 2 theorems, 32 equations, 10 figures, 4 tables)

This paper contains 39 sections, 2 theorems, 32 equations, 10 figures, 4 tables.

Key Result

Theorem 1

For any $V,N{>}0$, there exists a transformer with vocabulary size $V$, context size $N$, embedding size $d_e\texttt{=}N$, one attention layer with two heads and a three-layer MLP such that it generates any token sequence $(\texttt{Y}_1,...,\texttt{Y}_N) {\in} \{1,...,V\}^N$ when conditioned on the

Figures (10)

  • Figure 1: Attention patterns of a small transformer pretrained on sorting in ascending order. The model is given the prefix $S$ and user input $X$ and generates $Y$ autoregressively. We have highlighted the attention when the first response $\texttt{Y}_1$ is being generated. Full fine-tuning sorts in descending order but prefix-tuning cannot as it cannot update the learned attention. Note how the relative attention of $X$ to $X$ in the left and right plots is exactly the same: the prefix cannot change the attention pattern for the same inputs. The relative attention of $X$ to $X$ in the center plot is very different because full fine-tuning can arbitrarily change $\bm W_Q$ and $\bm W_K$.
  • Figure 2: Model pretrained on the four tasks. The four attention heads specialize in the skills necessary to solve these tasks: look at the elements in order, look first at the smallest elements or first at the largest elements.
  • Figure 3: Attention block activations for ten sequences at the last input position (10) when pretrained on the four tasks. The left plot shows the pretrained activations $\bm t_{10}$ are not predictive of the completion. The right plot shows prefixes cluster the activations $\bm t_{10}^\text{pt}$. Connecting the pretrained and prefixed activations highlights the bias. No dimensionality reduction is used; the clustering is solely due to the prefixes.
  • Figure 4: Illustration of the predictors for each token in the $\mathcal{L}_\text{proj}$ linear layer for $V=10$. The layer is constructed in such a way that the $i$-th token has the highest confidence when the input is $i-1/V$.
  • Figure 5: The attention of the twelfth head of the first layer of LLaMA touvron2023llama. The left plot shows the attention with a prefix of length one. The second plot shows the same attention but normalized such that the attenion over the non-prefix positions sums to 1. The right plot shows the attention of the pre-trained model (without prefix). The center and the right plots are the same, illustrating that the presence of the prefix indeed only scales down the attention over the content (non-prefix positions) but does not change its relative distribution, providing empirical validation of \ref{['eq:pt_attention_rescaling']}. The test sequence is TABLE: Fourth Round Qualifying : NEW_ENTRIES_THIS_ROUND : 24 TEXT: Fourth round qualifying had 24 new entries. from the DART table-to-test dataset nan2021dart.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Theorem 1: Exponential unconditional generation capacity of a single virtual token
  • Theorem 2: Conditional generation capacity for a single virtual token ($n_X\texttt{=}n_Y\texttt{=}1$)