How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

Tianyu Guo; Wei Hu; Song Mei; Huan Wang; Caiming Xiong; Silvio Savarese; Yu Bai

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

Tianyu Guo, Wei Hu, Song Mei, Huan Wang, Caiming Xiong, Silvio Savarese, Yu Bai

TL;DR

This work studies in-context learning (ICL) when task labels depend on inputs through a fixed representation $\Phi^\star$ followed by a varying linear readout, addressing a more realistic setting than simple function classes. It provides constructive theory showing decoder transformers can implement in-context ridge regression on the representations with mild depth, and validates these ideas empirically on synthetic data, observing a clear division where lower layers compute $\Phi^\star(\mathbf{x})$ and upper layers perform linear ICL. The paper also develops probing and pasting techniques to reveal mechanisms such as copying of inputs and representations and the upper-module’s ability to carry linear ICL independently, including in mixture-representation scenarios. These results offer mechanistic insight into how transformers could realize ICL in more complex, representation-based tasks and lay groundwork for extending to real-world representations. The findings highlight practical implications for designing prompts and architectures that separate representation learning from in-context adaptation, potentially improving robustness and interpretability of ICL in large language models.

Abstract

While large language models based on the transformer architecture have demonstrated remarkable in-context learning (ICL) capabilities, understandings of such capabilities are still in an early stage, where existing theory and mechanistic understanding focus mostly on simple scenarios such as learning simple function classes. This paper takes initial steps on understanding ICL in more complex scenarios, by studying learning with representations. Concretely, we construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function, composed with a linear function that differs in each instance. By construction, the optimal ICL algorithm first transforms the inputs by the representation function, and then performs linear ICL on top of the transformed dataset. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size. Empirically, we find trained transformers consistently achieve near-optimal ICL performance in this setting, and exhibit the desired dissection where lower layers transforms the dataset and upper layers perform linear ICL. Through extensive probing and a new pasting experiment, we further reveal several mechanisms within the trained transformers, such as concrete copying behaviors on both the inputs and the representations, linear ICL capability of the upper layers alone, and a post-ICL representation selection mechanism in a harder mixture setting. These observed mechanisms align well with our theory and may shed light on how transformers perform ICL in more realistic scenarios.

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

TL;DR

This work studies in-context learning (ICL) when task labels depend on inputs through a fixed representation

followed by a varying linear readout, addressing a more realistic setting than simple function classes. It provides constructive theory showing decoder transformers can implement in-context ridge regression on the representations with mild depth, and validates these ideas empirically on synthetic data, observing a clear division where lower layers compute

and upper layers perform linear ICL. The paper also develops probing and pasting techniques to reveal mechanisms such as copying of inputs and representations and the upper-module’s ability to carry linear ICL independently, including in mixture-representation scenarios. These results offer mechanistic insight into how transformers could realize ICL in more complex, representation-based tasks and lay groundwork for extending to real-world representations. The findings highlight practical implications for designing prompts and architectures that separate representation learning from in-context adaptation, potentially improving robustness and interpretability of ICL in large language models.

Abstract

Paper Structure (50 sections, 11 theorems, 76 equations, 32 figures)

This paper contains 50 sections, 11 theorems, 76 equations, 32 figures.

Introduction
Related work
In-context learning
In-weights learning versus in-context learning
Mechanistic understanding and probing techniques
Preliminaries
Transformers
In-context learning
In-context learning with representations
Supervised learning with representation
Theory
Proof techniques
Dynamical system with representation
Theory
Experiments
...and 35 more sections

Key Result

Theorem 1

For any representation function $\Phi^\star$ of form eqn:mlp, any $\lambda>0$, $B_\Phi,B_w,B_y>0$, $\varepsilon<B_\Phi B_w/2$, letting $\kappa\mathrel{\mathop:}= 1+B_\Phi^2/\lambda$, there exists a transformer ${\rm TF}$ with $L+\mathcal{O}{\left( \kappa\log(B_\Phi B_w/\varepsilon) \right)}$ layers,

Figures (32)

Figure 1: Illustration of our setting and theory
Figure 2: ICL risks
Figure 3: Linear probes
Figure 5: Varying noise level
Figure 6: Varying rep hidden dimension
...and 27 more figures

Theorems & Definitions (20)

Theorem 1: Transformer can implement $\Phi^\star$-Ridge
Theorem 2: Transformer can implement $\Phi^\star$-Ridge for dynamical system
Proposition A.1: Gradient descent for smooth and strongly convex functions
Lemma B.1: Copying by a single attention head
proof
Lemma B.2: Linear prediction layer
proof
Lemma B.3: Implementing MLP representation by transformers
proof
Proposition B.4: Approximating a single GD step by a single attention layer
...and 10 more

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

TL;DR

Abstract

How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (32)

Theorems & Definitions (20)