Table of Contents
Fetching ...

Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Orit Davidovich, Zohar Ringel

Abstract

We formally define Algorithmic Capture (i.e., ``grokking'' an algorithm) as the ability of a neural network to generalize to arbitrary problem sizes ($T$) with controllable error and minimal sample adaptation, distinguishing true algorithmic learning from statistical interpolation. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the functions these networks can learn. We show that despite their universal expressivity, transformers possess an inductive bias towards low-complexity algorithms within the Efficient Polynomial Time Heuristic Scheme (EPTHS) class. This bias effectively prevents them from capturing higher-complexity algorithms, while allowing success on simpler tasks like search, copy, and sort.

Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Abstract

We formally define Algorithmic Capture (i.e., ``grokking'' an algorithm) as the ability of a neural network to generalize to arbitrary problem sizes () with controllable error and minimal sample adaptation, distinguishing true algorithmic learning from statistical interpolation. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the functions these networks can learn. We show that despite their universal expressivity, transformers possess an inductive bias towards low-complexity algorithms within the Efficient Polynomial Time Heuristic Scheme (EPTHS) class. This bias effectively prevents them from capturing higher-complexity algorithms, while allowing success on simpler tasks like search, copy, and sort.
Paper Structure (25 sections, 1 theorem, 71 equations, 1 figure, 1 table)

This paper contains 25 sections, 1 theorem, 71 equations, 1 figure, 1 table.

Key Result

Lemma 5.1

For $A_c = \mathop{\mathrm{Softmax}}\nolimits(\bm{S})_c$, the sum of the absolute values of its first and second derivatives with respect to the logits $\bm{S}$ are universally bounded independently of $T$:

Figures (1)

  • Figure 1: Empirical Verification of Algorithmic Capture. We train models on problem instances of size $T_0$ reaching accuracy $\delta$ after seeing $P_0(\delta)$ data points. We then fine-tune these models on larger instance sizes ($T>T_0$), re-achieving $\delta$-accuracy after seeing $P$ extra datapoints. Dots are average empirical values based on $20$-$40$ transformer networks, and solid lines are best fits to $C \log(T/T_0)$ meant to guide the eye. For induction and sorting, $P$ appears bounded by a logarithmic growth, suggesting algorithmic capture. However, for Shortest Path and Minimal Cut, both very deep ($40$ layers) and standard ($4$ layers) architectures exhibit a superlinear growth. For further experimental details see App. \ref{['App:ExpDetails']}.

Theorems & Definitions (2)

  • Lemma 5.1: Absolute Hessian Sum of Softmax
  • proof