Table of Contents
Fetching ...

What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

Xingwu Chen, Difan Zou

TL;DR

This work probes how the depth of an attention-only transformer affects memorization, reasoning, generalization, and contextual generalization on four designed sequence tasks. It develops a theory that single-layer transformers can memorize but cannot perform the more complex tasks, while two-layer transformers enable reasoning and generalization and three-layer models enable contextual generalization, with deeper models offering faster learning in harder settings. The authors introduce a parsing-copying-matching mechanism and prove constructive existence results for specific layer-depth configurations, supported by synthetic experiments and attention-map analyses. The findings illuminate the architectural requirements for emergent abilities in transformers and offer practical guidance for efficient model design in sequence-based tasks. The work also motivates future exploration of nested in-context tasks and broader data regimes to generalize these depth-threshold insights.

Abstract

We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization. We show a transformer with only one attention layer can excel in memorization but falls short in other tasks. Then, we show that exhibiting reasoning and generalization ability requires the transformer to have at least two attention layers, while context generalization ability may necessitate three attention layers. Additionally, we identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations and thus can be resolved by stacking multiple attention layers. This sheds light on studying more practical and complex tasks beyond our design. Numerical experiments corroborate our theoretical findings.

What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

TL;DR

This work probes how the depth of an attention-only transformer affects memorization, reasoning, generalization, and contextual generalization on four designed sequence tasks. It develops a theory that single-layer transformers can memorize but cannot perform the more complex tasks, while two-layer transformers enable reasoning and generalization and three-layer models enable contextual generalization, with deeper models offering faster learning in harder settings. The authors introduce a parsing-copying-matching mechanism and prove constructive existence results for specific layer-depth configurations, supported by synthetic experiments and attention-map analyses. The findings illuminate the architectural requirements for emergent abilities in transformers and offer practical guidance for efficient model design in sequence-based tasks. The work also motivates future exploration of nested in-context tasks and broader data regimes to generalize these depth-threshold insights.

Abstract

We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization. We show a transformer with only one attention layer can excel in memorization but falls short in other tasks. Then, we show that exhibiting reasoning and generalization ability requires the transformer to have at least two attention layers, while context generalization ability may necessitate three attention layers. Additionally, we identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations and thus can be resolved by stacking multiple attention layers. This sheds light on studying more practical and complex tasks beyond our design. Numerical experiments corroborate our theoretical findings.
Paper Structure (39 sections, 17 theorems, 68 equations, 8 figures, 1 table)

This paper contains 39 sections, 17 theorems, 68 equations, 8 figures, 1 table.

Key Result

Theorem 5.1

For any dataset of the sequence classification task, denoted by $D_{\texttt{SC}}$, let $d$ be the token dimension, and $n$ be the length of the sequence (i.e., number of tokens). Then there exists a transformer $\texttt{TF}$ with $L = 1$ attention layer, $n$ attention heads, and model embedding dime

Figures (8)

  • Figure 1: Descriptions of the four sequence learning tasks considered in this work, including (1) sequence classification task; (2) in-context question answering task; (3) template matching task; and (4) in-context template matching task. Here each input, context, and query are represented as sequences consisting of multiple tokens.
  • Figure 2: Performance of different layers of transformers on memorization, reasoning, generalization, and contextual generalization tasks. Far left column: A single-layer transformer can memorize sequences with distinct labels. Center left column: A single-layer transformer struggles with reasoning tasks, while a two-layer transformer can learn reasoning with enough training steps. Center right column: A single-layer transformer struggles with generalizing on template tasks, while a two-layer transformer can quickly grasp the method for generalization. Far right column: When it comes to more complex contextual generalization tasks, a 1/2-layer transformer fails, but a 3-layer transformer can perform well on such tasks.
  • Figure 3: Attention maps for a trained two-layer transformer in the reasoning sequences "A=1B=2A=" (top row) and "A=1B=2A=" (bottom row).
  • Figure 4: Attention maps for a trained two-layer transformer in the template sequences "AAB=" (top row) and "ABB=" (bottom row).
  • Figure 5: Training dynamic for different attention heads on memorization task
  • ...and 3 more figures

Theorems & Definitions (32)

  • Definition 4.1
  • Theorem 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Theorem 5.4
  • Theorem 5.5
  • Theorem 5.6
  • Lemma C.1: Instructive attention
  • proof
  • Lemma C.2: Constrained attention
  • ...and 22 more