Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability

Ivan Lee; Nan Jiang; Taylor Berg-Kirkpatrick

Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability

Ivan Lee, Nan Jiang, Taylor Berg-Kirkpatrick

TL;DR

This paper asks whether attention is strictly required for in-context learning by systematically evaluating 13 diverse model architectures, trained from scratch, on a broad suite of synthetic ICL tasks. It demonstrates that ICL is a universal capability across architectures, with some attention-based models not consistently outperforming attention alternatives like state-space and recurrent designs. The study highlights differences in statistical efficiency, consistency, and memorization tendencies across architectures, and shows that prompt length and task difficulty significantly impact performance, challenging the notion that transformers are uniquely endowed for ICL. It also extends the analysis to data distribution effects (burstiness) and real-world-like prompts, offering nuanced guidance on when and how various architectures can leverage in-context cues for learning without gradient updates.

Abstract

What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate thirteen model architectures capable of causal language modeling across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying the number of in-context examples and task difficulty. We also measure each architecture's predisposition towards in-context learning when presented with the option to memorize rather than leverage in-context examples. Finally, and somewhat surprisingly, we find that several attention alternatives are sometimes competitive with or better in-context learners than transformers. However, no single architecture demonstrates consistency across all tasks, with performance either plateauing or declining when confronted with a significantly larger number of in-context examples than those encountered during gradient-based training.

Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 11 figures, 17 tables)

This paper contains 16 sections, 7 equations, 11 figures, 17 tables.

Introduction
Synthetic In-context Learning Tasks
Model Architectures
Learning to learn (in-context)
The influence of training data distributional properties
Towards in-context learning in the real world
A simple few-shot natural language task
Experimental Details
Experimental details for linear regression, multiclass classification, and associative recall
Experimental details for language modeling
Supplementary data for Section \ref{['sec:extrap']}: associative recall, linear regression, multiclass classification
Noisy linear regression
Supplementary data for Section \ref{['sec:omniglot']}: image classification
Supplementary data for Section \ref{['sec:language_modeling']}: Language Modeling
Transformer Positional Embedding Abalations
...and 1 more sections

Figures (11)

Figure 1: Evaluating various architectures on associative recall, linear regression, and multiclass classification. We plot test accuracy and mean squared error as a function of the number of in-context examples. A query index of $2^5=32$ implies $31$ in-context examples, which is also the highest number of in-context examples seen during training (vertical dotted line). Task difficulty increases from left to right. Each line represents the single run that achieved the best validation accuracy or mean squared error at query index $2^5$. See Tables \ref{['tab:lr_table_best']}, \ref{['tab:ar_table_best']}, \ref{['tab:gmm_table_best']} for a tabular view of the same data. See Figure \ref{['fig:extrap_line_average']} for average performance across training runs. See Appendix \ref{['appendix:noisy_lr']} for linear regression experiments with Gaussian noise where we observe trends are largely unchanged relative to the non-noisy setting. Classical baselines (black) are shown for linear regression (ridge regression) and multiclass classification (logistic regression).
Figure 2: Measuring the effects training data distributional properties on in-context learning. We plot average (over training runs) test accuracy as a function of training steps. P(bursty) indicates the proportion of training prompts that were bursty (with the remainder non-bursty). See Table \ref{['tab:og_table_average']} for a tabular view of the same data. See Figure \ref{['fig:og_line_best']} for training runs that achieved max validation accuracy.
Figure 3: Evaluating architectures on language modeling.Left: Validation loss during training. Middle: ICL score as training progresses. Right: Validation loss as a function of context length.
Figure 4: Evaluating various architectures on a simple natural language ICL task. We report accuracy as a function of the number of in-context examples. We use the open sourced weights for Llama2-7B and do not fine-tune. All other models are trained from scratch and are approximately 33M parameters (excluding embedding layers). Right: Flipped label setting, i.e., "happy" is replaced with "sad" and vice versa. See Figure \ref{['fig:simple_icl_flipped']} for normalized accuracy.
Figure 5: Evaluating various architectures on in-context learning associative recall, linear regression, and multiclass classification. We plot average test accuracy and mean squared error as a function of the number of in-context examples. A query index of $2^5=32$ implies $31$ in-context examples, which is also the highest number of in-context examples seen during training (vertical dotted line). Task difficulty increases from left to right. Each line represents an average over all training runs for a given combination of task, difficulty, and architecture. Classical baselines (black) are shown for linear regression (ridge regression) and multiclass classification (logistic regression). See Tables \ref{['tab:lr_table_average']}, \ref{['tab:ar_table_average']}, \ref{['tab:gmm_table_average']} for a tabular view of the same data. See Figure \ref{['fig:extrap_line_best']} for the training runs that achieved the best performance.
...and 6 more figures

Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability

TL;DR

Abstract

Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability

Authors

TL;DR

Abstract

Table of Contents

Figures (11)