Table of Contents
Fetching ...

The Transformer Cookbook

Andy Yang, Christopher Watson, Anton Xue, Satwik Bhattamishra, Jose Llarena, William Merrill, Emile Dos Santos Ferreira, Anej Svete, David Chiang

TL;DR

The Transformer Cookbook collects and systematizes algorithmic constructions that can be encoded directly in transformer parameters, addressing fragmented prior work by offering a unified, recipe-like reference. It formalizes how to represent discrete data, implement arithmetic and logic with feed-forward layers, and manipulate information flow via various attention schemes, including uniform, hard, and masked varieties. By presenting concrete constructions—such as induction heads and Dyck-language recognizers—the paper demonstrates the practical reach of transformer-based computation and provides a toolkit for theory, interpretability, and architecture design. The work aims to bridge theory and practice, enabling rigorous analysis while guiding empirical exploration toward reliable and safe AI systems.

Abstract

We present the transformer cookbook: a collection of techniques for directly encoding algorithms into a transformer's parameters. This work addresses the steep learning curve of such endeavors, a problem exacerbated by a fragmented literature where key results are scattered across numerous papers. In particular, we synthesize this disparate body of findings into a curated set of recipes that demonstrate how to implement everything from basic arithmetic in feed-forward layers to complex data routing via self-attention. Our mise en place of formulations is for both newcomers seeking an accessible entry point and experts in need of a systematic reference. This unified presentation of transformer constructions provides a foundation for future work spanning theoretical research in computational complexity to empirical investigations in architecture design and interpretability.

The Transformer Cookbook

TL;DR

The Transformer Cookbook collects and systematizes algorithmic constructions that can be encoded directly in transformer parameters, addressing fragmented prior work by offering a unified, recipe-like reference. It formalizes how to represent discrete data, implement arithmetic and logic with feed-forward layers, and manipulate information flow via various attention schemes, including uniform, hard, and masked varieties. By presenting concrete constructions—such as induction heads and Dyck-language recognizers—the paper demonstrates the practical reach of transformer-based computation and provides a toolkit for theory, interpretability, and architecture design. The work aims to bridge theory and practice, enabling rigorous analysis while guiding empirical exploration toward reliable and safe AI systems.

Abstract

We present the transformer cookbook: a collection of techniques for directly encoding algorithms into a transformer's parameters. This work addresses the steep learning curve of such endeavors, a problem exacerbated by a fragmented literature where key results are scattered across numerous papers. In particular, we synthesize this disparate body of findings into a curated set of recipes that demonstrate how to implement everything from basic arithmetic in feed-forward layers to complex data routing via self-attention. Our mise en place of formulations is for both newcomers seeking an accessible entry point and experts in need of a systematic reference. This unified presentation of transformer constructions provides a foundation for future work spanning theoretical research in computational complexity to empirical investigations in architecture design and interpretability.

Paper Structure

This paper contains 86 sections, 9 theorems, 116 equations, 1 figure, 4 tables.

Key Result

Lemma 5.3

Let $\mathbf{s} = (s_1, \ldots, s_n)$ be attention scores and let $j^* \in [n]$ and $\gamma > 0$ be such that for all $j \ne j^*$ we have $s_{j} < s_{j^*} - \gamma$. Then

Figures (1)

  • Figure 1: Prefix-sum checks for membership in the Dyck language. Each subpanel plots the running counts $O_k$ (opens) and $C_k$ (closes) at each position $k$ of a candidate string of length $N = 6$. (Left) Valid Dyck-1 string: $O_k \geq C_k$ for all $k < N$ and $O_N = C_N$, satisfying non-negativity and balanced counts, respectively. (Center) Prefix violation: Although $O_N = C_N$, at position $k = 3$, we have $O_3 < C_3$, violating non-negativity. (Right) Here $O_k \geq C_k$ for all $k$, but $O_N \neq C_N$, so the total counts are unbalanced.

Theorems & Definitions (24)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3: Self-Attention
  • Definition 2.4: ReLU
  • Definition 2.5: Feed-Forward Network
  • Definition 4.1: Continuous Piecewise Linear Function
  • Definition 4.2: GELU
  • Definition 5.1: Hardmax
  • Definition 5.2
  • Lemma 5.3: edelman2022inductive
  • ...and 14 more