A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks

Jacob Abernethy; Alekh Agarwal; Teodor V. Marinov; Manfred K. Warmuth

A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks

Jacob Abernethy, Alekh Agarwal, Teodor V. Marinov, Manfred K. Warmuth

TL;DR

This work analyzes the in-context learning capability of fixed transformers by proposing a mechanism that first segments an in-context prompt into (x_i, y_i) examples, then learns a sparse linear regressor from those examples, and finally applies the learned hypothesis to new queries without parameter updates. It provides formal segmentation and hypothesis-learning procedures, each with explicit sample complexity guarantees, and connects them to the transformer's attention structure. Empirically, the authors validate the 1-sparse tokenized regression setting, showing that a small number of in-context examples suffices to identify the correct coordinate, with attention patterns mirroring the theoretical steps. The findings illuminate the mechanisms by which ICL arises in sparse retrieval tasks and highlight the importance of delimiter design and pre-training priors for sample-efficient learning.

Abstract

We study the phenomenon of \textit{in-context learning} (ICL) exhibited by large language models, where they can adapt to a new learning task, given a handful of labeled examples, without any explicit parameter optimization. Our goal is to explain how a pre-trained transformer model is able to perform ICL under reasonable assumptions on the pre-training process and the downstream tasks. We posit a mechanism whereby a transformer can achieve the following: (a) receive an i.i.d. sequence of examples which have been converted into a prompt using potentially-ambiguous delimiters, (b) correctly segment the prompt into examples and labels, (c) infer from the data a \textit{sparse linear regressor} hypothesis, and finally (d) apply this hypothesis on the given test example and return a predicted label. We establish that this entire procedure is implementable using the transformer mechanism, and we give sample complexity guarantees for this learning framework. Our empirical findings validate the challenge of segmentation, and we show a correspondence between our posited mechanisms and observed attention maps for step (c).

A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks

TL;DR

Abstract

Paper Structure (39 sections, 22 theorems, 63 equations, 9 figures)

This paper contains 39 sections, 22 theorems, 63 equations, 9 figures.

Introduction
Problem Setting and Notation
The Transformer Architecture
In-Context Learning
An Overview of Results
Segmenting an input sequence
Learning a consistent hypothesis
Segmenting an ICL Instance
Implementation using a transformer.
Sample complexity of segmentation.
Learning a Consistent Hypothesis for Tokenized Sparse Regression
Empirical Results
Conclusion
Related Work
Notation
...and 24 more sections

Key Result

Theorem 1

There exists a transformer with $O(1)$ layers and $O(\mathcal{V}_{\ttfamily{\upshape delims}}\xspace\times\mathcal{V}_{\ttfamily{\upshape delims}}\xspace)$ heads per layer which computes $\widehat{\text{\ttfamily{\upshape <lsep>}}}\xspace, \widehat{\text{\ttfamily{\upshape <esep>}}}\xspace$ accordin

Figures (9)

Figure 1: An example of an ICL task: quotes from Shakespeare followed by the name of the play. Correct delimiters are $(\ttfamily{{\upshape <lsep>}}\xspace, \ttfamily{{\upshape <esep>}}\xspace) = (\ttfamily{/},\ttfamily{;})$, yet the presence of other potential delimiters creates ambiguity.
Figure 2: A transformer for $1$-sparse tokenized regression with $n=2$ examples and $m=3$ tokens per example. The curved lines show attentions, with heights proportional to the attention. The blue and red attention lines show the attentions of $y_1$ and $y_2$ over the previous tokens. The green attention lines show the attentions of $x_{2,2}$ and $x_{2,3}$ over the previous tokens. In this case, $f^\star=2$. After the first example, there is ambiguity between $f_1 \in\{2,3\}$, hence the output $f_1(x_2)$ mixes theseand is not correct. After the second example, the answer is uniquely determined, for inference on third example and beyond. In the first layer, each $y_i$ attends to tokens $x_{i,j}$ from example $i$ to find all consistent hypotheses in example $i$. By attending across previous $y_t$'s, each $y_i$ aggregates these hypotheses over all preceding inputs $t \leq i$. Example $i+1$ then attends to $y_i$ to predict using the aggregated hypothesis in the final two layers.
Figure 3: Loss and attention plots for $1$-sparse tokenized regression for Gaussian (top) and Rademacher (bottom) inputs. Loss drops to zero as soon as $f^\star$ is determined, and attentions follow the construction of Section \ref{['sec:hyp-learning']}. Indices $4, 10, 16,\ldots$ are tokens where the label is predicted. In panels (b) and (e), these indices attend to the index of $f^\star$ in $x_i$ to predict $y_i$ correctly. The target indices line (blue) in panel (b) perfectly overlaps with the attention spikes at tokens $x_{i,0}$. In panel (d), the attention spikes largely overlap with target indices, but there is some noise (see text). In panels (c) and (f), these indices attend to all previous labels (indices $5, 11, 17,\ldots$) to aggregate a consistent hypothesis across previous examples.
Figure 4: Train Gaussian, inference Gaussian.
Figure 5: Train Gaussian, inference Rademacher.
...and 4 more figures

Theorems & Definitions (38)

Definition 1
Definition 2
Theorem 1: Transformers can segment
Theorem 2: Sample complexity of segmentation, informal
Definition 3: Tokenized sparse regression
Theorem 3: Transformers find a consistent hypothesis
Theorem 4: Sample complexity of hypothesis learning, informal
Theorem 5
Lemma 1
Lemma 2
...and 28 more

A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks

TL;DR

Abstract

A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (38)