MLPs Learn In-Context on Regression and Classification Tasks
William L. Tong, Cengiz Pehlevan
TL;DR
The paper investigates whether in-context learning (ICL) can be achieved by non-attention-based models, showing that vanilla MLPs and MLP-Mixer architectures can learn ICL under the same compute budget as Transformers across synthetic regression, classification, and relational tasks. Using controlled, synthetic tasks with varying context length $L$, data diversity $k$, and input dimension $n$, the study compares MLPs, MLP-Mixers, Transformers, and relational bottleneck variants. Key findings include that MLPs can reach near Bayes-optimal Ridge performance in ICL regression and compete with or outperform Transformers in ICL classification and several relational tasks, with relational bottlenecks providing compute-efficient gains when task structure aligns. The results broaden the understanding of ICL beyond attention mechanisms, highlight the potential of all-MLP architectures for relational reasoning, and motivate further exploration of ICL in more complex, real-world settings and data-limited regimes.
Abstract
In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique hallmark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context comparably with Transformers under the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging prior arguments against MLPs' ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs in a synthetic setting, and support the growing interest in all-MLP alternatives to Transformer architectures. It remains unclear how MLPs perform against Transformers at scale on real-world tasks, and where a performance gap may originate. We encourage further exploration of these architectures in more complex settings to better understand the potential comparative advantage of attention-based schemes.
