Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

Yue Xing; Xiaofeng Lin; Chenheng Xu; Namjoon Suh; Qifan Song; Guang Cheng

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

Yue Xing, Xiaofeng Lin, Chenheng Xu, Namjoon Suh, Qifan Song, Guang Cheng

TL;DR

This paper analyzes the theoretical mechanisms by which transformers perform in-context learning when demonstrations are provided as unstructured data in the prompt, rather than the structured single-column format. It shows that a two-layer transformer with a look-ahead attention mask is necessary for ICL on unstructured prompts, while a single-layer model fails; positional encoding and larger embedding dimensions further improve the matching between input and target tokens, enabling more accurate predictions. The work provides rigorous theorems and supporting simulations demonstrating how x_i and y_i can be effectively matched in the first layer and how the second layer can extract y_q with error diminishing as prompt length grows, under certain architectural choices. These results clarify the minimal architectural ingredients needed for ICL on unstructured data and offer guidance for prompt design and transformer design in practical settings.

Abstract

Large language models (LLMs) are powerful models that can learn concepts at the inference stage via in-context learning (ICL). While theoretical studies, e.g., \cite{zhang2023trained}, attempt to explain the mechanism of ICL, they assume the input $x_i$ and the output $y_i$ of each demonstration example are in the same token (i.e., structured data). However, in real practice, the examples are usually text input, and all words, regardless of their logic relationship, are stored in different tokens (i.e., unstructured data \cite{wibisono2023role}). To understand how LLMs learn from the unstructured data in ICL, this paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture. In particular, we consider a simple transformer with one/two attention layers and linear regression tasks for the ICL prediction. We observe that (1) a transformer with two layers of (self-)attentions with a look-ahead attention mask can learn from the prompt in the unstructured data, and (2) positional encoding can match the $x_i$ and $y_i$ tokens to achieve a better ICL performance.

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

TL;DR

Abstract

and the output

of each demonstration example are in the same token (i.e., structured data). However, in real practice, the examples are usually text input, and all words, regardless of their logic relationship, are stored in different tokens (i.e., unstructured data \cite{wibisono2023role}). To understand how LLMs learn from the unstructured data in ICL, this paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture. In particular, we consider a simple transformer with one/two attention layers and linear regression tasks for the ICL prediction. We observe that (1) a transformer with two layers of (self-)attentions with a look-ahead attention mask can learn from the prompt in the unstructured data, and (2) positional encoding can match the

and

tokens to achieve a better ICL performance.

Paper Structure (32 sections, 7 theorems, 86 equations, 9 figures, 3 tables)

This paper contains 32 sections, 7 theorems, 86 equations, 9 figures, 3 tables.

Introduction
Separating columns of $x_i$ and $y_i$
Other components of transformers
Related Works
Theoretical studies
Transformer Architecture
Notations and Architecture
Data
Transformer architecture
Experiment Settings
Two-Layer Transformer + Attention Mask
Empirical Observation
Why One-Layer Transformer Fails
Why Two-Layer Transformer Succeeds
Positional Encoding
...and 17 more sections

Key Result

Theorem 1

Consider a transformer with one layer of softmax attention. Assume there is no $W_{in}$ and PE. Assume $W_{KQ,1}$ is in the form of for some $A\in\mathbb{R}^{d\times d}$ and $b\in\mathbb{R}^d$ such that $\|A\|$ and $\|b\|$ are both bounded. Assume $D\rightarrow\infty$, then regardless of whether the attention mask is in the transformer or not, the optimal solution of $b$ satisfies $\|b\|^2\righta

Figures (9)

Figure 1: ICL performance with different number of layers, mask, position. The two-layer structure and the attention mask are crucial, while positional encoding significantly improves performance.
Figure 2: Performance of one-layer and two-layer transformers with/without attention mask (no PE).
Figure 3: ICL prediction performance of training two layers vs training the second layer only.
Figure 4: Attention scores of the first 5 input pairs on a single head, two layers, with mask, $E_2$ format. One prompt. Each row is the attention of one token. No PE.
Figure 5: Attention scores of the first 5 input pairs on a single head, two layers, with mask, $E_2$ format. One prompt. Each row is the attention of one token. With PE.
...and 4 more figures

Theorems & Definitions (14)

Theorem 1: One layer attention is not sufficient
Theorem 2: Two layers with attention mask facilitate ICL
Theorem 3: First layer without embedding matrix
Theorem 4: PE is not flexible when $D\gg p=d+1$
Theorem 5: First layer with embedding matrix
Theorem 6: ICL performance with embedding matrix
Remark 1
Proposition 1
proof : Proof of Theorem \ref{['thm:one_layer']}
proof : Proof of Theorem \ref{['thm:two_layer_lazy']}
...and 4 more

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

TL;DR

Abstract

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (14)