In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks

Ayush Goel; Arjun Kohli; Sarvagya Somvanshi

In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks

Ayush Goel, Arjun Kohli, Sarvagya Somvanshi

TL;DR

This paper empirically study how the similarities and limitations of linear attention relative to quadratic attention in their ICL behavior on the canonical linear-regression task of Garg et al.

Abstract

Recent work has demonstrated that transformers and linear attention models can perform in-context learning (ICL) on simple function classes, such as linear regression. In this paper, we empirically study how these two attention mechanisms differ in their ICL behavior on the canonical linear-regression task of Garg et al. We evaluate learning quality (MSE), convergence, and generalization behavior of each architecture. We also analyze how increasing model depth affects ICL performance. Our results illustrate both the similarities and limitations of linear attention relative to quadratic attention in this setting.

In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks

TL;DR

This paper empirically study how the similarities and limitations of linear attention relative to quadratic attention in their ICL behavior on the canonical linear-regression task of Garg et al.

Abstract

Paper Structure (25 sections, 6 equations, 3 figures, 3 tables)

This paper contains 25 sections, 6 equations, 3 figures, 3 tables.

Introduction and Background
Our Contribution
Methods
Task and Data Generation
Model Architectures
Shared Hyperparameters
Quadratic Attention Transformer
Hyperparameters
Linear Attention Transformer
Kernelized Attention and Feature Map Selection
Recurrent Computation
Hyperparameters
Training and Evaluation
Hypotheses
Results
...and 10 more sections

Figures (3)

Figure 1: Training and Testing loss. Top row shows shallower networks (1 and 3 layers), while the bottom row shows the deeper 6-layer model.
Figure 2: Training and Testing loss for the Linear Attention Transformer
Figure 3: Isotropic vs. anisotropic performance across depths.

In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks

TL;DR

Abstract

In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)