Table of Contents
Fetching ...

MLPs Learn In-Context on Regression and Classification Tasks

William L. Tong, Cengiz Pehlevan

TL;DR

The paper investigates whether in-context learning (ICL) can be achieved by non-attention-based models, showing that vanilla MLPs and MLP-Mixer architectures can learn ICL under the same compute budget as Transformers across synthetic regression, classification, and relational tasks. Using controlled, synthetic tasks with varying context length $L$, data diversity $k$, and input dimension $n$, the study compares MLPs, MLP-Mixers, Transformers, and relational bottleneck variants. Key findings include that MLPs can reach near Bayes-optimal Ridge performance in ICL regression and compete with or outperform Transformers in ICL classification and several relational tasks, with relational bottlenecks providing compute-efficient gains when task structure aligns. The results broaden the understanding of ICL beyond attention mechanisms, highlight the potential of all-MLP architectures for relational reasoning, and motivate further exploration of ICL in more complex, real-world settings and data-limited regimes.

Abstract

In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique hallmark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context comparably with Transformers under the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging prior arguments against MLPs' ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs in a synthetic setting, and support the growing interest in all-MLP alternatives to Transformer architectures. It remains unclear how MLPs perform against Transformers at scale on real-world tasks, and where a performance gap may originate. We encourage further exploration of these architectures in more complex settings to better understand the potential comparative advantage of attention-based schemes.

MLPs Learn In-Context on Regression and Classification Tasks

TL;DR

The paper investigates whether in-context learning (ICL) can be achieved by non-attention-based models, showing that vanilla MLPs and MLP-Mixer architectures can learn ICL under the same compute budget as Transformers across synthetic regression, classification, and relational tasks. Using controlled, synthetic tasks with varying context length , data diversity , and input dimension , the study compares MLPs, MLP-Mixers, Transformers, and relational bottleneck variants. Key findings include that MLPs can reach near Bayes-optimal Ridge performance in ICL regression and compete with or outperform Transformers in ICL classification and several relational tasks, with relational bottlenecks providing compute-efficient gains when task structure aligns. The results broaden the understanding of ICL beyond attention mechanisms, highlight the potential of all-MLP architectures for relational reasoning, and motivate further exploration of ICL in more complex, real-world settings and data-limited regimes.

Abstract

In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique hallmark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context comparably with Transformers under the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging prior arguments against MLPs' ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs in a synthetic setting, and support the growing interest in all-MLP alternatives to Transformer architectures. It remains unclear how MLPs perform against Transformers at scale on real-world tasks, and where a performance gap may originate. We encourage further exploration of these architectures in more complex settings to better understand the potential comparative advantage of attention-based schemes.
Paper Structure (50 sections, 3 theorems, 22 equations, 10 figures, 3 tables)

This paper contains 50 sections, 3 theorems, 22 equations, 10 figures, 3 tables.

Key Result

Theorem A.1

Suppose the label function $f^*$ is non-constant. Then for all SGD steps $t$, there exists a template $\boldsymbol{z} \in \mathcal{W}^k$ and a string $\boldsymbol{x}$ consisting of symbols $x_1 x_2 \ldots x_k \in \mathcal{X}_{uns}^k$ which satisfy $\boldsymbol{z}$ such that where $c$ is a constant that depends only on $f^*$, and the expectation is taken over random initialization of parameters $\

Figures (10)

  • Figure 1: ICL regression and classification results.(a) ICL presents context exemplars from a novel task (red), followed by a query input (blue). The model must infer the solution (green) based on the context. (b) ICL regression example. The model receives linearly-related input points, and must regress the query point. (c) Compute vs. MSE on the unrestricted task distribution. Each point represents a single model, with particular parameters and training iterations. At large compute, MSE is approximately equal across all architectures. The red line corresponds to the Bayes optimal Ridge MSE. (d) Excess MSE (MSE above Bayes optimal) for varying context length $L$ on the unrestricted task distribution. Excess MSE remains flat for Mixers, but rises somewhat for Transformers. MLPs fail to learn in-context at all beyond $2^6$ context exemplars. The grey line corresponds to the excess MSE incurred by always guessing zero. (e, f) IWL to ICL transition with increasing data diversity. We train on a finite distribution with $k$ weights, then test on both the finite training distribution and the unrestricted distribution. All models exhibit a transition from IWL (represented by dMMSE) to ICL (represented by Ridge) as $k$ increases. Note: it is possible to "outperform" Bayes optimal Ridge on the finite training distribution by learning in-weight the underlying $\boldsymbol{\beta}$'s. (g) ICL classification example, with burstiness $B = 3$. Multiple clusters may share the same label. (h) Compute vs. cross entropy loss on ICL classification, with $k = 2048$ clusters, $B = 4$, and $L = 8$, which pushes all models to learn in-context. At large compute, all architectures attain near-zero cross entropy loss. The gray line corresponds to loss obtained from placing equal probability on the 2 (of $C = 32$) labels present in context. (i) Cross entropy loss for varying context length $L$ on the task configuration in (h). Loss is relatively flat for all architectures, though it increases a little for Mixers. (j) IWL to ICL transition with increasing data diversity, where $L = 8$ and $B = 4$. All models exhibit a transition from IWL to ICL as the number of clusters $k$ increases. (all) We use $n = 8$ dimension inputs. All line plots feature 95 percent confidence intervals about the mean, estimated from 5 replications.
  • Figure 2: Relational reasoning results. Global legend is at the bottom right. (a) Match-to-sample task. (b) Compute vs. cross entropy loss on MTS task. Each point represents a single model, with particular parameters and training time. RB MLPs attain the best loss with the smallest compute, followed by MLPs and Transformers. (c) OOD generalization on MTS. In-distribution radii are highlighted in red. MLPs and RB MLPs generalize well on OOD radii. No model generalizes well on OOD test scrambling. (d) Sphere oddball task. (e) Same as in (b), for sphere oddball. (f) OOD generalization on sphere oddball. In-distribution distance is highlighted in red. Red dashed lines correspond to the accuracy obtained by guessing that the furthest point away from the cluster center is the oddball. (g) Logit of oddball point as its distance from the center increases. Dashed lines correspond to different polynomial scalings. Only the Transformer fails to increase its logit with distance. (h) Line oddball task. (i) Compute vs. loss on line oddball task. RB MLP no longer learns the task well, but appending additional MLP layers ("RB MLP (deep)") helps. (j) OOD generalization on line oddball. In-distribution distance is highlighted in red. Red lines indicate accuracy attained by a model guessing that the furthest point away from the center is the oddball. MLPs continue to generalize stronger than Transformers, and match the deep RB MLP. (all) Shaded regions and error bars correspond to 95 percent confidence intervals estimated from 5 replications.
  • Figure 3: MLP accuracy on unseen symbols for the same-different task. The gray dashed line indicates chance-level performance. Shaded region indicates 95 percent confidence regions estimated from 5 replications. For higher data diversity (i.e. number of symbols in the task), the MLP generalizes progressively better. Beyond roughly $2^9$ symbols in the task, the MLP performs substantially above chance, and approaches perfect generalization beyond $2^{12}$ symbols.
  • Figure 4: Simple regression and classification results.(a) MLPs attain substantially lower MSE at lower compute than Transformers. The red line corresponds to the minimum attainable MSE. (b) Transformers attain performance given larger token sizes. (c, d) Same as in (a, b), for classification, with $k = 16$ clusters. (all) We use $n = 64$ dimension inputs. Other parameterizations are explored in Appendix \ref{['app:mo_figures']}. Shaded regions correspond to 95 percent confidence intervals estimated from 5 replications.
  • Figure 5: ICL regression with an autoregressive objective. For each input example $(\boldsymbol{x}_1, y_1, \boldsymbol{x}_2, y_2, \ldots, \boldsymbol{x}_L, y_L)$, we compute the autoregressive loss $\sum_i \mathcal{L} (f(\boldsymbol{x}_1, y_1, \boldsymbol{x}_2, y_2, \ldots \boldsymbol{x}_i), y_{i})$, for a neural network $f$ and MSE loss $\mathcal{L}$. For vanilla MLPs and Mixers, variable-length inputs are handled by padding inputs with zero up to the max length $L$. (a) Compute vs. MSE on the unrestricted task distribution. Each point represents a single model, with particular parameters and training iterations. Just as in the fixed input length case, at large compute, MSE is approximately equal across all architectures. The red line corresponds to the Bayes optimal Ridge MSE. (b) Excess MSE (MSE above Bayes optimal) for varying context length $L$ on the unrestricted task distribution. Excess MSE remains flat for Mixers and Transformers, but rises for MLPs. The grey line corresponds to the excess MSE incurred by the zero predictor. Given compute limitations, we plot on a slightly narrower range of context lengths, but the overall trends remain consistent with the finite-input-length case. (c, d) IWL to ICL transition with increasing data diversity. We train on a finite distribution with $k$ weights, then test on both the finite training distribution and the unrestricted distribution. Just as with finite input lengths, all models exhibit a transition from IWL (represented by dMMSE) to ICL (represented by Ridge) as $k$ increases. Note: it is possible to "outperform" Bayes optimal Ridge on the finite training distribution by learning in-weight the underlying $\boldsymbol{\beta}$'s. (all) We use $n = 8$ dimension inputs. All line plots feature 95 percent confidence intervals about the mean, estimated from 5 replications.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Theorem A.1: From Boix-Adsera et al., failure of MLPs at generalizing on unseen symbols
  • Proposition A.1: Permutation invariance of template satisfaction
  • proof
  • Proposition A.2: Conditions for generalizing to unseen inputs
  • proof