Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding
Qian Ma, Ruoxiang Xu, Yongqiang Cai
TL;DR
This work analyzes how vocabulary-constrained inputs (VICL) interact with Transformer self-attention to realize universal approximation capabilities. It shows that single-layer Transformers without positional encoding cannot approximate arbitrary functions when inputs come from a finite vocabulary, but including positional encoding enables UAP under density conditions on the combined input set. The authors build a bridge between one-hidden-layer FNNs and Transformers, leveraging FNN UAP and a Kronecker-approximation-based construction to prove VICL UAP with PE, and they also derive non-UAP results for VICL without PE. The findings highlight the pivotal role of positional encoding in enabling in-context learning to function as a universal approximator in discrete-input settings, offering theoretical guidance for the design of PE schemes in practical VICL applications.
Abstract
Numerous studies have demonstrated that the Transformer architecture possesses the capability for in-context learning (ICL). In scenarios involving function approximation, context can serve as a control parameter for the model, endowing it with the universal approximation property (UAP). In practice, context is represented by tokens from a finite set, referred to as a vocabulary, which is the case considered in this paper, \emph{i.e.}, vocabulary in-context learning (VICL). We demonstrate that VICL in single-layer Transformers, without positional encoding, does not possess the UAP; however, it is possible to achieve the UAP when positional encoding is included. Several sufficient conditions for the positional encoding are provided. Our findings reveal the benefits of positional encoding from an approximation theory perspective in the context of ICL.
