Prompting a Pretrained Transformer Can Be a Universal Approximator

Aleksandar Petrov; Philip H. S. Torr; Adel Bibi

Prompting a Pretrained Transformer Can Be a Universal Approximator

Aleksandar Petrov, Philip H. S. Torr, Adel Bibi

TL;DR

This work investigates whether prefix-based fine-tuning of pretrained transformers can universally approximate any continuous sequence-to-sequence function. It proves that a single attention head, with a carefully constructed prefix, can approximate any function on the hypersphere $S^m$ to arbitrary precision, and provides Jackson-type bounds on the required prefix length. Extending to sequences, the paper shows that general seq-to-seq mappings can be realized with depth linear in the sequence length, using a Kolmogorov–Arnold-inspired construction that concatenates univariate mappings. The results illuminate the expressive power of attention heads under prefix control, offer a framework for understanding prompting and safety implications, and highlight potential efficiency trade-offs relative to full model training.

Abstract

Despite the widespread adoption of prompting, prompt tuning and prefix-tuning of transformer models, our theoretical understanding of these fine-tuning methods remains limited. A key question is whether one can arbitrarily modify the behavior of pretrained model by prompting or prefix-tuning it. Formally, whether prompting and prefix-tuning a pretrained model can universally approximate sequence-to-sequence functions. This paper answers in the affirmative and demonstrates that much smaller pretrained models than previously thought can be universal approximators when prefixed. In fact, the attention mechanism is uniquely suited for universal approximation with prefix-tuning a single attention head being sufficient to approximate any continuous function. Moreover, any sequence-to-sequence function can be approximated by prefixing a transformer with depth linear in the sequence length. Beyond these density-type results, we also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.

Prompting a Pretrained Transformer Can Be a Universal Approximator

TL;DR

to arbitrary precision, and provides Jackson-type bounds on the required prefix length. Extending to sequences, the paper shows that general seq-to-seq mappings can be realized with depth linear in the sequence length, using a Kolmogorov–Arnold-inspired construction that concatenates univariate mappings. The results illuminate the expressive power of attention heads under prefix control, offer a framework for understanding prompting and safety implications, and highlight potential efficiency trade-offs relative to full model training.

Abstract

Paper Structure (18 sections, 25 theorems, 128 equations, 5 figures)

This paper contains 18 sections, 25 theorems, 128 equations, 5 figures.

Introduction
Background Material
Transformer Architecture
Universal Approximation
Universal Approximation with a Single Attention Head
Universal Approximation of Sequence-to-Sequence Functions
Element-wise functions
General sequence-to-sequence functions
Discussion and Conclusions
Comparison with prior work
Connection to prompting and safety implications
Prefix-Tuning and Prompting a Pretrained Transformer might be Less efficient than Training it
Prefix-tuning and prompting may work by combining prefix-based element-wise maps with pretrained cross-element mixing
Limitations.
Background on Analysis on the Sphere
...and 3 more sections

Key Result

Lemma 2.1

If $\mathcal{A}$ is dense in $\mathcal{B}$ and $\mathcal{B}$ is dense in $\mathcal{C}$, then $\mathcal{A}$ is dense in $\mathcal{C}$.

Figures (5)

Figure 1: Approximating functions on the hypersphere with a single attention head.A. We simplify the classical attention head into a core attention head. B. The $\exp(\lambda \langle {\bm x, \bm p_k^\alpha} \rangle)\bm p_k^\beta$ terms act like kernels when $\bm x$ is restricted to a hypersphere. We can approximate a function $f$ by placing $N$ control points $\bm p_1^\alpha,...,\bm p_N^\alpha$ and centering a kernel at each of them. C. Increasing $\lambda$ results in less smoothing, while increasing $N$ results in more control points and hence better approximation. With large enough $\lambda$ and $N$, we can approximate $f$ to any desired accuracy. D. With the normalization term in classical attention close to a constant, and giving $\bm x$, $\bm p_k^\alpha$ and $\bm p_k^\beta$ orthogonal subspaces, core attention can be represented as classical attention. Hence, a classical attention head can also approximate $f$ with arbitrary precision.
Figure 2: The dot product is a measure of closeness over the hypersphere. We want large dot product for points with lower distances. That is not the case for general $\bm p_1^\alpha\!,\bm p_2^\alpha\in\mathbb{R}^{m+1}$: above we show larger dot product for points which are further away, i.e., $\langle {\bm x, \bm p_1^\alpha} \rangle {<} \langle {\bm x, \bm p_2^\alpha} \rangle$ despite $\|\bm x \texttt{-} \bm p_1^\alpha\|_2 {<} \|\bm x \texttt{-} \bm p_1^\alpha\|_2$. However, if we restrict $\bm x$, $\bm p_i^\alpha$, and $\bm p_j^\alpha$ to the hypersphere $S^m$, then the dot product measures the cosine between $\bm x$ and $\bm p_i$ which is truly a measure of closeness: $\langle {\bm x, \bm p_i^\alpha} \rangle {<} \langle {\bm x, \bm p_j^\alpha} \rangle \iff \|\bm x \texttt{-} \bm p_i^\alpha\|_2 {>} \|\bm x \texttt{-} \bm p_j^\alpha\|_2$.
Figure 3: Plots of the von Mises-Fisher kernel $K^\text{vMF}_\lambda(\langle {\bm x,\bm y} \rangle)$ for $\lambda=1,5,10$ and fixed $\bm y$ in three dimensions ($m=2$). The larger $\lambda$ is, the more concentrated the kernel is around $\bm y$.
Figure 4: Intuition behind the proof of our Jackson-type bound for universal approximation on the hypersphere.A. We want to approximate a function $f$ over the hypersphere $S^m$. This illustration is in three-dimensional space, so $m=2$. B. In order to get the $\exp(\lambda \langle {\cdot, \bm y} \rangle)$ form that we want, we convolve $f$ with the $K^\text{vMF}_\lambda(t) = c_{m+1}(\lambda) \exp(\lambda t)$ kernel. C. We partition $S^m$ into $N$ cells $V_1$,…,$V_N$. D. Our choice of $N$ is such that $f$ does not vary too much in each cell and hence can be approximated by a function that is constant in each $V_k$. E. As each cell is small, the dot product of $\bm x$ with any point in the cell $V_k$ can be approximated by the dot product of $\bm x$ with a fixed point $\bm b_k\in V_k$. F. This allows us to approximate the integral in the convolution $K^\text{vMF}_\lambda$ with a finite sum.
Figure 5: The coefficients $a_k^m$ for the von Mises-Fisher kernels $K^\text{vMF}_\lambda$ for $m=2$ and $k\in\{1,2,3\}$ as well as the lower bound from \ref{['lemma:bounds_on_coefficients']}.

Theorems & Definitions (51)

Definition 2.1: Universal Approximation (Density-Type)
Lemma 2.1: Transitivity
Definition 2.2: Approximation Rate (Jackson-Type)
Lemma 2.2
Definition 2.3: Prefixed Attention Heads Class
Definition 2.4: Prefixed Transformers Class
Definition 2.5: Scalar Functions on the Hypersphere
Definition 2.6: Vector-valued Functions on the Hypersphere
Definition 2.7: General Sequence-to-sequence Functions
Definition 2.8: Element-wise functions
...and 41 more

Prompting a Pretrained Transformer Can Be a Universal Approximator

TL;DR

Abstract

Prompting a Pretrained Transformer Can Be a Universal Approximator

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (51)