Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability
Ivan Lee, Nan Jiang, Taylor Berg-Kirkpatrick
TL;DR
This paper asks whether attention is strictly required for in-context learning by systematically evaluating 13 diverse model architectures, trained from scratch, on a broad suite of synthetic ICL tasks. It demonstrates that ICL is a universal capability across architectures, with some attention-based models not consistently outperforming attention alternatives like state-space and recurrent designs. The study highlights differences in statistical efficiency, consistency, and memorization tendencies across architectures, and shows that prompt length and task difficulty significantly impact performance, challenging the notion that transformers are uniquely endowed for ICL. It also extends the analysis to data distribution effects (burstiness) and real-world-like prompts, offering nuanced guidance on when and how various architectures can leverage in-context cues for learning without gradient updates.
Abstract
What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate thirteen model architectures capable of causal language modeling across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying the number of in-context examples and task difficulty. We also measure each architecture's predisposition towards in-context learning when presented with the option to memorize rather than leverage in-context examples. Finally, and somewhat surprisingly, we find that several attention alternatives are sometimes competitive with or better in-context learners than transformers. However, no single architecture demonstrates consistency across all tasks, with performance either plateauing or declining when confronted with a significantly larger number of in-context examples than those encountered during gradient-based training.
