Table of Contents
Fetching ...

Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Guanyu Chen, Ruichen Wang, Tianren Zhang, Feng Chen

Abstract

In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model's inherent in-weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a task representation space and a sample representation space. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate the effectiveness of our proposed architecture, CoQE, in the single-value answer setting. It not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across synthetic few-shot classification and a newly designed pseudo-arithmetic task. Code: https://github.com/McGuinnessChen/dual-representation-space-encoding

Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Abstract

In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model's inherent in-weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a task representation space and a sample representation space. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate the effectiveness of our proposed architecture, CoQE, in the single-value answer setting. It not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across synthetic few-shot classification and a newly designed pseudo-arithmetic task. Code: https://github.com/McGuinnessChen/dual-representation-space-encoding
Paper Structure (54 sections, 8 theorems, 48 equations, 12 figures, 4 tables)

This paper contains 54 sections, 8 theorems, 48 equations, 12 figures, 4 tables.

Key Result

Proposition 5.3

Let $\mathcal{X}$ be the input space and $\mathcal{Y}_f$ the multiple label sets corresponding to each task $f \in \mathcal{F}$. Under Definition 3.2, there exists a linear sample representation space $\mathcal{M}_\mathcal{F}$ and a linear task transformation space $\mathcal{T}$, where $\mathcal{T}$

Figures (12)

  • Figure 1: (Top left) Overview of our synthetic task setup. (Top right) ICL and IWL performances under different training settings ($E$, $L$, $P_\text{bursty}$, $\alpha$). The Transformers fluctuate between ICL and IWL capabilities, whereas our CoQE models robustly reconcile the two capabilities. (Bottom) Representation visualization of ten classes on distinct context conditions. We observe that good clusters for samples and good clusters for contexts are hard to achieve simultaneously. Detailed discussions are presented in Section \ref{['rep_anal']}.
  • Figure 2: (Left) For three training settings and four training checkpoints (3k, 10k, 50k, 100k steps), there are clear positive correlations between ICL performance and CSC, and between IWL performance and SSC. (Right) Through various experiments, we observe that larger $E$ improves IWL but causes ICL to disappear, and larger $L$ consistently enhances ICL while slightly harming IWL. Different colors and line styles represent different training settings. Detailed results are provided in Table \ref{['tab:model_performance']}.
  • Figure 3: Comparison of Transformer and CoQE architectures. Unlike the Transformer, which encodes both context-level and sample-level information into the same representation space, CoQE implements dual representation spaces encoding that explicitly distinguishes between context and samples.
  • Figure 4: Construction and training of the task Representation space for few-shot classification.
  • Figure 5: Results of ICL regression. We provide optimal baselines for test settings except for combination functions. CoQE consistently achieves lower ICL error than the Transformer in both ID and OOD scenarios.
  • ...and 7 more figures

Theorems & Definitions (18)

  • Definition 3.1: Dual space
  • Definition 5.1: Linear sample representation space
  • Definition 5.2: Linear task transformation space
  • Proposition 5.3: Task-sample duality
  • Definition 5.4: Task representation space
  • Definition 5.5: Basis task representations
  • Theorem 5.6: Completeness of basis representations under task traversal
  • Definition 5.7: Context-induced task representation in ICL
  • Proposition 5.8: Closed form of $\omega_f$ under simplified LSA
  • Theorem 5.9: Entangled structure under general SA
  • ...and 8 more