Table of Contents
Fetching ...

Training In-Context and In-Weights Mixtures Via Contrastive Context Sampling

Deeptanshu Malu, Deevyanshu Malu, Aditya Nemiwal, Sunita Sarawagi

Abstract

We investigate training strategies that co-develop in-context learning (ICL) and in-weights learning (IWL), and the ability to switch between them based on context relevance. Although current LLMs exhibit both modes, standard task-specific fine-tuning often erodes ICL, motivating IC-Train - fine-tuning with in-context examples. Prior work has shown that emergence of ICL after IC-Train depends on factors such as task diversity and training duration. In this paper we show that the similarity structure between target inputs and context examples also plays an important role. Random context leads to loss of ICL and IWL dominance, while only similar examples in context causes ICL to degenerate to copying labels without regard to relevance. To address this, we propose a simple Contrastive-Context which enforces two types of contrasts: (1) mix of similar and random examples within a context to evolve a correct form of ICL, and (2) varying grades of similarity across contexts to evolve ICL-IWL mixtures. We present insights on the importance of such contrast with theoretical analysis of a minimal model. We validate with extensive empirical evaluation on four LLMs and several tasks. Diagnostic probes confirm that contrasted contexts yield stable ICL-IWL mixtures, avoiding collapse into pure ICL, IWL, or copying.

Training In-Context and In-Weights Mixtures Via Contrastive Context Sampling

Abstract

We investigate training strategies that co-develop in-context learning (ICL) and in-weights learning (IWL), and the ability to switch between them based on context relevance. Although current LLMs exhibit both modes, standard task-specific fine-tuning often erodes ICL, motivating IC-Train - fine-tuning with in-context examples. Prior work has shown that emergence of ICL after IC-Train depends on factors such as task diversity and training duration. In this paper we show that the similarity structure between target inputs and context examples also plays an important role. Random context leads to loss of ICL and IWL dominance, while only similar examples in context causes ICL to degenerate to copying labels without regard to relevance. To address this, we propose a simple Contrastive-Context which enforces two types of contrasts: (1) mix of similar and random examples within a context to evolve a correct form of ICL, and (2) varying grades of similarity across contexts to evolve ICL-IWL mixtures. We present insights on the importance of such contrast with theoretical analysis of a minimal model. We validate with extensive empirical evaluation on four LLMs and several tasks. Diagnostic probes confirm that contrasted contexts yield stable ICL-IWL mixtures, avoiding collapse into pure ICL, IWL, or copying.

Paper Structure

This paper contains 55 sections, 23 equations, 32 figures, 3 tables.

Figures (32)

  • Figure 1: Visual summary of main findings of the paper towards our goal of in-weights learning (IWL) a task while retaining ICL for continuous adaptation with new examples. On the X-axis is target to context similarity --- IWL important on left side and ICL on right side. Standard fine-tuning with zero in-context examples causes drop in ICL compared to base model. Fine-tuning with in-context example (IC-Train) sensitive to target-context similarity: random context leads to sharp drop in ICL, similar context does not develop IWL and instead is prone to blind copying. Our method Contrastive-Context retains both IWL and ICL and teaches model to switch between them.
  • Figure 2: Schematic of a minimal two-layer transformer with a summarizer and in-weights learner ($\hat{f}$) in layer-1 and a three parameter second layer that implements the ICL-IWL mixtures.
  • Figure 3: Effect of fine-tuning a base model with different strategies (Zero-Context, and IC-Train under Random-Context, Similar-Context, and Contrastive-Context) on accuracy over varying grades of similarity to in-context examples for 32 different models, language-pairs, and test-sets. Remaining plots in Appendix Figure \ref{['sec:appendix:fig1']}. X-axis: Level of maximum similarity of target with in-context examples. The similarity ranges here are - Low: $0-0.33$, Medium: $0.33-0.67$, High: $0.67-1$. Y-axis: Accuracy (COMET score). Main observations: Contrastive-Context is among the most accurate across the entire spectrum of target-context relatedness. On targets with high context similarity, model fine-tuned with Zero-Context is worse than baseline, Random-Context even worse than Zero-Context. On targets with low context similarity, Similar-Context is worse than Zero-Context and Random-Context.
  • Figure 4: Emergence of different forms of learning in three training methods: Random-Context, Similar-Context, and Contrastive-Context. X-axis is training steps and Y-axis denotes scores of one of the three probes. Results of other model-task and datasets in Appendix Figure \ref{['sec:appendix:fig3']}. IWL-score of Similar-Context is lowest, ICL-score of Random-Context diminishes fast with training, Copy-score of Similar-Context is higher. Contrastive-Context provides best retention of ICL and IWL capabilities without resorting to copying.
  • Figure 5: Ablations to show (a) the robustness of Contrastive-Context to variant non-zero $\epsilon$, $p$ values, (b) various paraphrasing models, (c) importance of various levels of similarity in the training data. $X$-axis is target-context similarity and $Y$-axis accuracy.
  • ...and 27 more figures