Table of Contents
Fetching ...

Knee-Deep in C-RASP: A Transformer Depth Hierarchy

Andy Yang, Michaël Cadilhac, David Chiang

TL;DR

The paper provides a formal theory linking Transformer depth to expressivity by proving an exact equivalence between fixed-precision Transformers (with rounding inside attention) and TL With Counting / C-RASP, preserving depth. It establishes a strict depth hierarchy for TL With Counting (and related logics), implying a corresponding depth separation for fixed-precision transformers and commonly used position encodings like RoPE and ALiBi. The authors introduce Cropping and Reduction lemmas, along with affix restrictions, to show that deeper constructs can express languages that shallower ones cannot, and they validate the theory with next-token prediction experiments on a family of ${L_k}$ languages. This work provides a rigorous foundation for understanding depth-related capabilities in Transformers and offers insights into length generalization and depth-efficient architectural design.

Abstract

It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). The same is also proven for transformers with positional encodings (like RoPE and ALiBi). These results are established by studying a temporal logic with counting operators equivalent to C-RASP. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.

Knee-Deep in C-RASP: A Transformer Depth Hierarchy

TL;DR

The paper provides a formal theory linking Transformer depth to expressivity by proving an exact equivalence between fixed-precision Transformers (with rounding inside attention) and TL With Counting / C-RASP, preserving depth. It establishes a strict depth hierarchy for TL With Counting (and related logics), implying a corresponding depth separation for fixed-precision transformers and commonly used position encodings like RoPE and ALiBi. The authors introduce Cropping and Reduction lemmas, along with affix restrictions, to show that deeper constructs can express languages that shallower ones cannot, and they validate the theory with next-token prediction experiments on a family of languages. This work provides a rigorous foundation for understanding depth-related capabilities in Transformers and offers insights into length generalization and depth-efficient architectural design.

Abstract

It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). The same is also proven for transformers with positional encodings (like RoPE and ALiBi). These results are established by studying a temporal logic with counting operators equivalent to C-RASP. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.

Paper Structure

This paper contains 41 sections, 39 theorems, 51 equations, 5 figures.

Key Result

Lemma 2.8

For all $k \ge 0$, define $L_k$ to be the language of strings with alternating blocks of $a$'s and $b$'s, starting with $a$: Then $L_k$ is $k$-piecewise testable.

Figures (5)

  • Figure 1: Theoretical results. C-RASP is equivalent to fixed-precision transformers, and a strict depth hierarchy for C-RASP (deeper programs solve more problems) implies a strict depth hierarchy for fixed-precision transformers (deeper networks solve more problems).
  • Figure 2: Our theoretical results predict that a transformer with depth $k$ can recognize language ${L_{{\textcolor{black}{k+2}}}}$ but not ${L_{{\textcolor{black}{k+3}}}}$ (demarcated by the black line), and this closely predicts our experimental results (shown as numbers and colors).
  • Figure 3: Strings can be pictured as paths, and minimal depth-1 subformulas as half-planes.
  • Figure 4: Within a sufficiently large rectangle ($I$), one can find a sub-rectangle ($I'$) within which the depth-1 subformulas are either always true or always false.
  • Figure 6: Experimental results. \ref{['cor:prediction_task_depth']} predicts that a transformer with depth $k$ can recognize language ${L_{{\textcolor{black}{k+2}}}}$ but not ${L_{{\textcolor{black}{k+3}}}}$ (demarcated by the black line). Up to at least ${L_{{\textcolor{black}{12}}}}$, this closely predicts our experimental results (shown as numbers and colors).

Theorems & Definitions (60)

  • Definition 2.1
  • Example 2.2
  • Definition 2.3: parikh1966context
  • Definition 2.4
  • Definition 2.5
  • Definition 2.6
  • Definition 2.7
  • Lemma 2.8
  • Lemma 2.8
  • Theorem 3.1
  • ...and 50 more