A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks
Jacob Abernethy, Alekh Agarwal, Teodor V. Marinov, Manfred K. Warmuth
TL;DR
This work analyzes the in-context learning capability of fixed transformers by proposing a mechanism that first segments an in-context prompt into (x_i, y_i) examples, then learns a sparse linear regressor from those examples, and finally applies the learned hypothesis to new queries without parameter updates. It provides formal segmentation and hypothesis-learning procedures, each with explicit sample complexity guarantees, and connects them to the transformer's attention structure. Empirically, the authors validate the 1-sparse tokenized regression setting, showing that a small number of in-context examples suffices to identify the correct coordinate, with attention patterns mirroring the theoretical steps. The findings illuminate the mechanisms by which ICL arises in sparse retrieval tasks and highlight the importance of delimiter design and pre-training priors for sample-efficient learning.
Abstract
We study the phenomenon of \textit{in-context learning} (ICL) exhibited by large language models, where they can adapt to a new learning task, given a handful of labeled examples, without any explicit parameter optimization. Our goal is to explain how a pre-trained transformer model is able to perform ICL under reasonable assumptions on the pre-training process and the downstream tasks. We posit a mechanism whereby a transformer can achieve the following: (a) receive an i.i.d. sequence of examples which have been converted into a prompt using potentially-ambiguous delimiters, (b) correctly segment the prompt into examples and labels, (c) infer from the data a \textit{sparse linear regressor} hypothesis, and finally (d) apply this hypothesis on the given test example and return a predicted label. We establish that this entire procedure is implementable using the transformer mechanism, and we give sample complexity guarantees for this learning framework. Our empirical findings validate the challenge of segmentation, and we show a correspondence between our posited mechanisms and observed attention maps for step (c).
