Table of Contents
Fetching ...

Interpreting Affine Recurrence Learning in GPT-style Transformers

Samarth Bhargav, Alexander Gu

TL;DR

An investigation into the mechanistic interpretability of GPT-style transformers, focusing specifically on their ability to learn and predict affine recurrences as an ICL task, reveals that the model forms an initial estimate of the target sequence using a copying mechanism in the zeroth layer, which is subsequently refined through negative similarity heads in the second layer.

Abstract

Understanding the internal mechanisms of GPT-style transformers, particularly their capacity to perform in-context learning (ICL), is critical for advancing AI alignment and interpretability. In-context learning allows transformers to generalize during inference without modifying their weights, yet the precise operations driving this capability remain largely opaque. This paper presents an investigation into the mechanistic interpretability of these transformers, focusing specifically on their ability to learn and predict affine recurrences as an ICL task. To address this, we trained a custom three-layer transformer to predict affine recurrences and analyzed the model's internal operations using both empirical and theoretical approaches. Our findings reveal that the model forms an initial estimate of the target sequence using a copying mechanism in the zeroth layer, which is subsequently refined through negative similarity heads in the second layer. These insights contribute to a deeper understanding of transformer behaviors in recursive tasks and offer potential avenues for improving AI alignment through mechanistic interpretability. Finally, we discuss the implications of our results for future work, including extensions to higher-dimensional recurrences and the exploration of polynomial sequences.

Interpreting Affine Recurrence Learning in GPT-style Transformers

TL;DR

An investigation into the mechanistic interpretability of GPT-style transformers, focusing specifically on their ability to learn and predict affine recurrences as an ICL task, reveals that the model forms an initial estimate of the target sequence using a copying mechanism in the zeroth layer, which is subsequently refined through negative similarity heads in the second layer.

Abstract

Understanding the internal mechanisms of GPT-style transformers, particularly their capacity to perform in-context learning (ICL), is critical for advancing AI alignment and interpretability. In-context learning allows transformers to generalize during inference without modifying their weights, yet the precise operations driving this capability remain largely opaque. This paper presents an investigation into the mechanistic interpretability of these transformers, focusing specifically on their ability to learn and predict affine recurrences as an ICL task. To address this, we trained a custom three-layer transformer to predict affine recurrences and analyzed the model's internal operations using both empirical and theoretical approaches. Our findings reveal that the model forms an initial estimate of the target sequence using a copying mechanism in the zeroth layer, which is subsequently refined through negative similarity heads in the second layer. These insights contribute to a deeper understanding of transformer behaviors in recursive tasks and offer potential avenues for improving AI alignment through mechanistic interpretability. Finally, we discuss the implications of our results for future work, including extensions to higher-dimensional recurrences and the exploration of polynomial sequences.

Paper Structure

This paper contains 22 sections, 5 equations, 18 figures.

Figures (18)

  • Figure 1: Types of Affine Recurrences.
  • Figure 2: Loss as a Function of Steps.
  • Figure 3: Layer 0 Attention Patterns.
  • Figure 4: Layer 1 Attention Patterns.
  • Figure 5: Layer 2 Attention Patterns (Alternating Sequence).
  • ...and 13 more figures