Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Orit Davidovich; Zohar Ringel

Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Orit Davidovich, Zohar Ringel

Abstract

We formally define Algorithmic Capture (i.e., ``grokking'' an algorithm) as the ability of a neural network to generalize to arbitrary problem sizes ($T$) with controllable error and minimal sample adaptation, distinguishing true algorithmic learning from statistical interpolation. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the functions these networks can learn. We show that despite their universal expressivity, transformers possess an inductive bias towards low-complexity algorithms within the Efficient Polynomial Time Heuristic Scheme (EPTHS) class. This bias effectively prevents them from capturing higher-complexity algorithms, while allowing success on simpler tasks like search, copy, and sort.

Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Abstract

We formally define Algorithmic Capture (i.e., ``grokking'' an algorithm) as the ability of a neural network to generalize to arbitrary problem sizes (

) with controllable error and minimal sample adaptation, distinguishing true algorithmic learning from statistical interpolation. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the functions these networks can learn. We show that despite their universal expressivity, transformers possess an inductive bias towards low-complexity algorithms within the Efficient Polynomial Time Heuristic Scheme (EPTHS) class. This bias effectively prevents them from capturing higher-complexity algorithms, while allowing success on simpler tasks like search, copy, and sort.

Paper Structure (25 sections, 1 theorem, 71 equations, 1 figure, 1 table)

This paper contains 25 sections, 1 theorem, 71 equations, 1 figure, 1 table.

Introduction
Related Work
Setting and Definitions
Low Complexity Bias of Lazy Learning
The Infinite-width Limit
The Fully-connected Network
Kernel Propagation
Complexity and Accuracy of Kernel Evaluation
Tighter Bounds and Feature Learning
Perturbation Theory Estimate
Experiments
Discussion
Transformers in Detail
Setup
Inference-time Complexity
...and 10 more sections

Key Result

Lemma 5.1

For $A_c = \mathop{\mathrm{Softmax}}\nolimits(\bm{S})_c$, the sum of the absolute values of its first and second derivatives with respect to the logits $\bm{S}$ are universally bounded independently of $T$:

Figures (1)

Figure 1: Empirical Verification of Algorithmic Capture. We train models on problem instances of size $T_0$ reaching accuracy $\delta$ after seeing $P_0(\delta)$ data points. We then fine-tune these models on larger instance sizes ($T>T_0$), re-achieving $\delta$-accuracy after seeing $P$ extra datapoints. Dots are average empirical values based on $20$-$40$ transformer networks, and solid lines are best fits to $C \log(T/T_0)$ meant to guide the eye. For induction and sorting, $P$ appears bounded by a logarithmic growth, suggesting algorithmic capture. However, for Shortest Path and Minimal Cut, both very deep ($40$ layers) and standard ($4$ layers) architectures exhibit a superlinear growth. For further experimental details see App. \ref{['App:ExpDetails']}.

Theorems & Definitions (2)

Lemma 5.1: Absolute Hessian Sum of Softmax
proof

Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Abstract

Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Authors

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (2)