Table of Contents
Fetching ...

Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding

Qian Ma, Ruoxiang Xu, Yongqiang Cai

TL;DR

This work analyzes how vocabulary-constrained inputs (VICL) interact with Transformer self-attention to realize universal approximation capabilities. It shows that single-layer Transformers without positional encoding cannot approximate arbitrary functions when inputs come from a finite vocabulary, but including positional encoding enables UAP under density conditions on the combined input set. The authors build a bridge between one-hidden-layer FNNs and Transformers, leveraging FNN UAP and a Kronecker-approximation-based construction to prove VICL UAP with PE, and they also derive non-UAP results for VICL without PE. The findings highlight the pivotal role of positional encoding in enabling in-context learning to function as a universal approximator in discrete-input settings, offering theoretical guidance for the design of PE schemes in practical VICL applications.

Abstract

Numerous studies have demonstrated that the Transformer architecture possesses the capability for in-context learning (ICL). In scenarios involving function approximation, context can serve as a control parameter for the model, endowing it with the universal approximation property (UAP). In practice, context is represented by tokens from a finite set, referred to as a vocabulary, which is the case considered in this paper, \emph{i.e.}, vocabulary in-context learning (VICL). We demonstrate that VICL in single-layer Transformers, without positional encoding, does not possess the UAP; however, it is possible to achieve the UAP when positional encoding is included. Several sufficient conditions for the positional encoding are provided. Our findings reveal the benefits of positional encoding from an approximation theory perspective in the context of ICL.

Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding

TL;DR

This work analyzes how vocabulary-constrained inputs (VICL) interact with Transformer self-attention to realize universal approximation capabilities. It shows that single-layer Transformers without positional encoding cannot approximate arbitrary functions when inputs come from a finite vocabulary, but including positional encoding enables UAP under density conditions on the combined input set. The authors build a bridge between one-hidden-layer FNNs and Transformers, leveraging FNN UAP and a Kronecker-approximation-based construction to prove VICL UAP with PE, and they also derive non-UAP results for VICL without PE. The findings highlight the pivotal role of positional encoding in enabling in-context learning to function as a universal approximator in discrete-input settings, offering theoretical guidance for the design of PE schemes in practical VICL applications.

Abstract

Numerous studies have demonstrated that the Transformer architecture possesses the capability for in-context learning (ICL). In scenarios involving function approximation, context can serve as a control parameter for the model, endowing it with the universal approximation property (UAP). In practice, context is represented by tokens from a finite set, referred to as a vocabulary, which is the case considered in this paper, \emph{i.e.}, vocabulary in-context learning (VICL). We demonstrate that VICL in single-layer Transformers, without positional encoding, does not possess the UAP; however, it is possible to achieve the UAP when positional encoding is included. Several sufficient conditions for the positional encoding are provided. Our findings reveal the benefits of positional encoding from an approximation theory perspective in the context of ICL.

Paper Structure

This paper contains 45 sections, 19 theorems, 96 equations, 1 figure, 1 table.

Key Result

Lemma 2

Let $\operatorname{\sigma}: \mathbb{R} \to \mathbb{R}$ be a non-polynomial, locally bounded, piecewise continuous activation function. For any continuous function $f: \mathbb{R}^{d_x} \to \mathbb{R}^{d_y}$ defined on a compact domain $\mathcal{K}$, and for any $\varepsilon > 0$, there exist $k \in \

Figures (1)

  • Figure 1: An illustration of non-approximability. The black curve represents the target function, which has $N + 1$ zero points. The red curve represents a sum of exponentials, which has no more than $N$ zero points. If the UAP holds, then the red curve must pass near the $N + 2$ marked extrema in the figure. By the Intermediate Value Theorem, the function represented by the red curve would then have $N + 1$ zeros, which contradicts its intrinsic properties.

Theorems & Definitions (32)

  • Lemma 2: UAP of FNNs Leshno1993Multilayera
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Theorem 6
  • Theorem 7
  • Theorem 8: Informal Version
  • Remark 9
  • Lemma \ref{lem:trans_as_NN}
  • proof
  • ...and 22 more