Analyzing limits for in-context learning
Omar Naim, Jerome Bolte, Nicholas Asher
TL;DR
This work interrogates the claim that transformer-based in-context learning implements classical learning algorithms with robust out-of-distribution generalization. Through a controlled polynomial-function task and formal analysis, it shows that ICL performance is constrained by architecture—particularly attention and Layer Normalization—leading to interpolation within training distributions and poor extrapolation. Empirically, two-layer attention-only models suffice for in-distribution learning, but generalization deteriorates under distribution shifts and across degrees of polynomials; theoretically, the authors prove that attention-only transformers cannot learn linear functions on significantly out-of-distribution inputs. These findings challenge algorithmic interpretations of ICL and point to fundamental architectural limits, with implications for how we train and design contextual learning systems.
Abstract
Our paper challenges claims from prior research that transformer-based models, when learning in context, implicitly implement standard learning algorithms. We present empirical evidence inconsistent with this view and provide a mathematical analysis demonstrating that transformers cannot achieve general predictive accuracy due to inherent architectural limitations.
