Table of Contents
Fetching ...

Can In-context Learning Really Generalize to Out-of-distribution Tasks?

Qixun Wang, Yifei Wang, Yisen Wang, Xianghua Ying

TL;DR

This work investigates whether in-context learning (ICL) can generalize to out-of-distribution (OOD) tasks. Using synthetic GPT-2 pretraining on function classes (LR, QR, ReLU NN) and experiments with real LLMs, it shows that ICL largely implements pretraining functions and exhibits retrieval-based behavior for abstract-label tasks under ID conditions. The authors formalize an algorithm-selection mechanism comprising a low-test-error preference and a similar-input-distribution bias, and validate it both theoretically (Gaussian-mixture analysis) and empirically, including real-language experiments that demonstrate prompt-level algorithm selection. Overall, the findings suggest that true OOD generalization via ICL remains challenging for current transformers, with practical ICL performance dominated by within-distribution priors and retrieval strategies rather than novel task acquisition.

Abstract

In this work, we explore the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training. To achieve this, we conduct synthetic experiments where the objective is to learn OOD mathematical functions through ICL using a GPT-2 model. We reveal that Transformers may struggle to learn OOD task functions through ICL. Specifically, ICL performance resembles implementing a function within the pretraining hypothesis space and optimizing it with gradient descent based on the in-context examples. Additionally, we investigate ICL's well-documented ability to learn unseen abstract labels in context. We demonstrate that such ability only manifests in the scenarios without distributional shifts and, therefore, may not serve as evidence of new-task-learning ability. Furthermore, we assess ICL's performance on OOD tasks when the model is pretrained on multiple tasks. Both empirical and theoretical analyses demonstrate the existence of the \textbf{low-test-error preference} of ICL, where it tends to implement the pretraining function that yields low test error in the testing context. We validate this through numerical experiments. This new theoretical result, combined with our empirical findings, elucidates the mechanism of ICL in addressing OOD tasks.

Can In-context Learning Really Generalize to Out-of-distribution Tasks?

TL;DR

This work investigates whether in-context learning (ICL) can generalize to out-of-distribution (OOD) tasks. Using synthetic GPT-2 pretraining on function classes (LR, QR, ReLU NN) and experiments with real LLMs, it shows that ICL largely implements pretraining functions and exhibits retrieval-based behavior for abstract-label tasks under ID conditions. The authors formalize an algorithm-selection mechanism comprising a low-test-error preference and a similar-input-distribution bias, and validate it both theoretically (Gaussian-mixture analysis) and empirically, including real-language experiments that demonstrate prompt-level algorithm selection. Overall, the findings suggest that true OOD generalization via ICL remains challenging for current transformers, with practical ICL performance dominated by within-distribution priors and retrieval strategies rather than novel task acquisition.

Abstract

In this work, we explore the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training. To achieve this, we conduct synthetic experiments where the objective is to learn OOD mathematical functions through ICL using a GPT-2 model. We reveal that Transformers may struggle to learn OOD task functions through ICL. Specifically, ICL performance resembles implementing a function within the pretraining hypothesis space and optimizing it with gradient descent based on the in-context examples. Additionally, we investigate ICL's well-documented ability to learn unseen abstract labels in context. We demonstrate that such ability only manifests in the scenarios without distributional shifts and, therefore, may not serve as evidence of new-task-learning ability. Furthermore, we assess ICL's performance on OOD tasks when the model is pretrained on multiple tasks. Both empirical and theoretical analyses demonstrate the existence of the \textbf{low-test-error preference} of ICL, where it tends to implement the pretraining function that yields low test error in the testing context. We validate this through numerical experiments. This new theoretical result, combined with our empirical findings, elucidates the mechanism of ICL in addressing OOD tasks.

Paper Structure

This paper contains 38 sections, 5 theorems, 16 equations, 13 figures, 2 tables.

Key Result

Lemma 5.2

(Appendix H.1 of lin2024dual) Consider any two different pretraining component $\alpha$ and $\beta$, given a testing context $\mathcal{S}_T \oplus \boldsymbol{x}_{T+1}$ and the well-pretrained model $M^*$, the ratio between the weights of the two task priors $\tilde{\pi}_\alpha/\tilde{\pi}_\beta$ in Further, assuming the testing in-context examples $\boldsymbol{x}_i\sim \mathcal{N}(\boldsymbol{\mu

Figures (13)

  • Figure 1: The ICL test error of Transformers trained on different function classes (solid lines) and the performance of the models from the corresponding pretraining functions classes trained by gradient descent (GD) using the in-context examples (dashed lines). Y-axis: test square error. X-axis: context length. In all evaluation tasks, we observe that as the test context length increases, the ICL performance of the Transformer pretrained on a particular function class closely approaches that of the model from this function class trained by GD.
  • Figure 2: The top-1 accuracy of predicting the reversed query word (blue) and predicting the reversed target label word (orange). The accuracy of predicting the reversed query word is higher than outputting the reversed target, indicating ICL makes ID predictions.
  • Figure 3: The ICL test error of Transformers trained on the retrieval task with different numbers of label tokens. "Eval" denotes "evaluated on". Note that the indices of training label tokens $I_{y_i}\in [50, 455)$, so the labels in (a) are ID while (b) and (c) are OOD.
  • Figure 4: The ICL test error of Transformers trained and tested on the linear regression + retrieval task with different numbers of label tokens. "Eval" denotes "evaluated on". Only the model trained on the largest number of tasks exhibits generalization to unseen label tokens.
  • Figure 5: The ICL test error of Transformers evaluated on a quadratic regression + retrieval task. Different colors denote models trained on the linear regression + retrieval task with different numbers of label tokens. "Eval" denotes "evaluation". The model trained on $s\sim \mathcal{U}(100, 2000)$ doesn't generalize better than the other two models.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Lemma 5.2
  • Theorem 5.3
  • Lemma D.1
  • Lemma E.1
  • Theorem E.3
  • proof