Table of Contents
Fetching ...

Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study

Xingxuan Zhang, Haoran Wang, Jiansheng Li, Yuan Xue, Shikai Guan, Renzhe Xu, Hao Zou, Han Yu, Peng Cui

TL;DR

The paper probes how transformers generalize through in-context learning by introducing a task-centric framework with inter-problem, intra-problem, and intra-task axes. Through function-fitting and real-world experiments (tool-calling and translation), it demonstrates strong intra-problem and intra-task generalization but a lack of inter-problem generalization; diverse, mixed-task training can meaningfully improve generalization to unseen tasks and even boost performance on simple tasks. Finetuned large models show similar intra-task benefits but limited cross-domain transfer unless composition data is included during finetuning. The findings emphasize designing training data to cover diverse tasks and compositions to unlock the full potential of ICL in practical transformer deployments.

Abstract

Large language models (LLMs) like GPT-4 and LLaMA-3 utilize the powerful in-context learning (ICL) capability of Transformer architecture to learn on the fly from limited examples. While ICL underpins many LLM applications, its full potential remains hindered by a limited understanding of its generalization boundaries and vulnerabilities. We present a systematic investigation of transformers' generalization capability with ICL relative to training data coverage by defining a task-centric framework along three dimensions: inter-problem, intra-problem, and intra-task generalization. Through extensive simulation and real-world experiments, encompassing tasks such as function fitting, API calling, and translation, we find that transformers lack inter-problem generalization with ICL, but excel in intra-task and intra-problem generalization. When the training data includes a greater variety of mixed tasks, it significantly enhances the generalization ability of ICL on unseen tasks and even on known simple tasks. This guides us in designing training data to maximize the diversity of tasks covered and to combine different tasks whenever possible, rather than solely focusing on the target task for testing.

Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study

TL;DR

The paper probes how transformers generalize through in-context learning by introducing a task-centric framework with inter-problem, intra-problem, and intra-task axes. Through function-fitting and real-world experiments (tool-calling and translation), it demonstrates strong intra-problem and intra-task generalization but a lack of inter-problem generalization; diverse, mixed-task training can meaningfully improve generalization to unseen tasks and even boost performance on simple tasks. Finetuned large models show similar intra-task benefits but limited cross-domain transfer unless composition data is included during finetuning. The findings emphasize designing training data to cover diverse tasks and compositions to unlock the full potential of ICL in practical transformer deployments.

Abstract

Large language models (LLMs) like GPT-4 and LLaMA-3 utilize the powerful in-context learning (ICL) capability of Transformer architecture to learn on the fly from limited examples. While ICL underpins many LLM applications, its full potential remains hindered by a limited understanding of its generalization boundaries and vulnerabilities. We present a systematic investigation of transformers' generalization capability with ICL relative to training data coverage by defining a task-centric framework along three dimensions: inter-problem, intra-problem, and intra-task generalization. Through extensive simulation and real-world experiments, encompassing tasks such as function fitting, API calling, and translation, we find that transformers lack inter-problem generalization with ICL, but excel in intra-task and intra-problem generalization. When the training data includes a greater variety of mixed tasks, it significantly enhances the generalization ability of ICL on unseen tasks and even on known simple tasks. This guides us in designing training data to maximize the diversity of tasks covered and to combine different tasks whenever possible, rather than solely focusing on the target task for testing.

Paper Structure

This paper contains 47 sections, 8 equations, 25 figures, 19 tables.

Figures (25)

  • Figure 1: Illustration of inter-problem, intra-problem, and intra-task generalization. Here we use single function fitting as the base problem $\mathcal{T}_{base}$, and addition and multiplication as the combination problems $\mathcal{T}_{com}$ for example.
  • Figure 2: Function curves on the convex combinations of base functions fitted by the Baseline and ComFuncLearner models after ICL. Com stands for combination and the weights of base functions are randomly sampled and labeled on the figure.
  • Figure 3: Function curves on the product combinations of base functions fitted by the Baseline and ComFuncLearner models after ICL.
  • Figure 4: Function curves on the compositional combinations of base functions fitted by the Baseline and ComFuncLearner models after ICL.
  • Figure 5: The fitted curves of LLaMa-3 on composition combinations of learned functions (absolute value functions $-\mathbf{|x|}$ and quadratic function $\mathbf{x^2}$).
  • ...and 20 more figures

Theorems & Definitions (2)

  • Definition 2.1: In-context Learning Task
  • Definition 2.2: In-context Learning Problem