Can In-context Learning Really Generalize to Out-of-distribution Tasks?
Qixun Wang, Yifei Wang, Yisen Wang, Xianghua Ying
TL;DR
This work investigates whether in-context learning (ICL) can generalize to out-of-distribution (OOD) tasks. Using synthetic GPT-2 pretraining on function classes (LR, QR, ReLU NN) and experiments with real LLMs, it shows that ICL largely implements pretraining functions and exhibits retrieval-based behavior for abstract-label tasks under ID conditions. The authors formalize an algorithm-selection mechanism comprising a low-test-error preference and a similar-input-distribution bias, and validate it both theoretically (Gaussian-mixture analysis) and empirically, including real-language experiments that demonstrate prompt-level algorithm selection. Overall, the findings suggest that true OOD generalization via ICL remains challenging for current transformers, with practical ICL performance dominated by within-distribution priors and retrieval strategies rather than novel task acquisition.
Abstract
In this work, we explore the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training. To achieve this, we conduct synthetic experiments where the objective is to learn OOD mathematical functions through ICL using a GPT-2 model. We reveal that Transformers may struggle to learn OOD task functions through ICL. Specifically, ICL performance resembles implementing a function within the pretraining hypothesis space and optimizing it with gradient descent based on the in-context examples. Additionally, we investigate ICL's well-documented ability to learn unseen abstract labels in context. We demonstrate that such ability only manifests in the scenarios without distributional shifts and, therefore, may not serve as evidence of new-task-learning ability. Furthermore, we assess ICL's performance on OOD tasks when the model is pretrained on multiple tasks. Both empirical and theoretical analyses demonstrate the existence of the \textbf{low-test-error preference} of ICL, where it tends to implement the pretraining function that yields low test error in the testing context. We validate this through numerical experiments. This new theoretical result, combined with our empirical findings, elucidates the mechanism of ICL in addressing OOD tasks.
