Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Guanyu Chen; Ruichen Wang; Tianren Zhang; Feng Chen

Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Guanyu Chen, Ruichen Wang, Tianren Zhang, Feng Chen

Abstract

In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model's inherent in-weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a task representation space and a sample representation space. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate the effectiveness of our proposed architecture, CoQE, in the single-value answer setting. It not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across synthetic few-shot classification and a newly designed pseudo-arithmetic task. Code: https://github.com/McGuinnessChen/dual-representation-space-encoding

Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Abstract

Paper Structure (54 sections, 8 theorems, 48 equations, 12 figures, 4 tables)

This paper contains 54 sections, 8 theorems, 48 equations, 12 figures, 4 tables.

Introduction
Related Work
Investigations on ICL Mechanisms.
Relationship between ICL and IWL.
Linearization in Latent Space.
Preliminaries
In-context learning setup.
Transformer model.
Dual space.
Representation Space Analysis
Synthetic Task Setup
Observations and Analysis
Observation 1: ICL corresponds to good representations for context, while IWL corresponds to good representations for samples. The two are difficult to achieve simultaneously.
Observation 2: Model size also affects the ICL-IWL tradeoff.
Dual-Space Modeling of Task and Sample Repressentations
...and 39 more sections

Key Result

Proposition 5.3

Let $\mathcal{X}$ be the input space and $\mathcal{Y}_f$ the multiple label sets corresponding to each task $f \in \mathcal{F}$. Under Definition 3.2, there exists a linear sample representation space $\mathcal{M}_\mathcal{F}$ and a linear task transformation space $\mathcal{T}$, where $\mathcal{T}$

Figures (12)

Figure 1: (Top left) Overview of our synthetic task setup. (Top right) ICL and IWL performances under different training settings ($E$, $L$, $P_\text{bursty}$, $\alpha$). The Transformers fluctuate between ICL and IWL capabilities, whereas our CoQE models robustly reconcile the two capabilities. (Bottom) Representation visualization of ten classes on distinct context conditions. We observe that good clusters for samples and good clusters for contexts are hard to achieve simultaneously. Detailed discussions are presented in Section \ref{['rep_anal']}.
Figure 2: (Left) For three training settings and four training checkpoints (3k, 10k, 50k, 100k steps), there are clear positive correlations between ICL performance and CSC, and between IWL performance and SSC. (Right) Through various experiments, we observe that larger $E$ improves IWL but causes ICL to disappear, and larger $L$ consistently enhances ICL while slightly harming IWL. Different colors and line styles represent different training settings. Detailed results are provided in Table \ref{['tab:model_performance']}.
Figure 3: Comparison of Transformer and CoQE architectures. Unlike the Transformer, which encodes both context-level and sample-level information into the same representation space, CoQE implements dual representation spaces encoding that explicitly distinguishes between context and samples.
Figure 4: Construction and training of the task Representation space for few-shot classification.
Figure 5: Results of ICL regression. We provide optimal baselines for test settings except for combination functions. CoQE consistently achieves lower ICL error than the Transformer in both ID and OOD scenarios.
...and 7 more figures

Theorems & Definitions (18)

Definition 3.1: Dual space
Definition 5.1: Linear sample representation space
Definition 5.2: Linear task transformation space
Proposition 5.3: Task-sample duality
Definition 5.4: Task representation space
Definition 5.5: Basis task representations
Theorem 5.6: Completeness of basis representations under task traversal
Definition 5.7: Context-induced task representation in ICL
Proposition 5.8: Closed form of $\omega_f$ under simplified LSA
Theorem 5.9: Entangled structure under general SA
...and 8 more

Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Abstract

Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Authors

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (18)