Table of Contents
Fetching ...

Algorithmic Capabilities of Random Transformers

Ziqian Zhong, Jacob Andreas

TL;DR

This work investigates what functions can be learned by randomly initialized transformers in which only the embedding layers are optimized, so that the only input--output mappings learnable from data are those already implemented by the randomly initialized model.

Abstract

Trained transformer models have been found to implement interpretable procedures for tasks like arithmetic and associative recall, but little is understood about how the circuits that implement these procedures originate during training. To what extent do they depend on the supervisory signal provided to models, and to what extent are they attributable to behavior already present in models at the beginning of training? To investigate these questions, we investigate what functions can be learned by randomly initialized transformers in which only the embedding layers are optimized, so that the only input--output mappings learnable from data are those already implemented (up to a choice of encoding scheme) by the randomly initialized model. We find that these random transformers can perform a wide range of meaningful algorithmic tasks, including modular arithmetic, in-weights and in-context associative recall, decimal addition, parenthesis balancing, and even some aspects of natural language text generation. Our results indicate that some algorithmic capabilities are present in transformers (and accessible via appropriately structured inputs) even before these models are trained. Code is available at https://github.com/fjzzq2002/random_transformers.

Algorithmic Capabilities of Random Transformers

TL;DR

This work investigates what functions can be learned by randomly initialized transformers in which only the embedding layers are optimized, so that the only input--output mappings learnable from data are those already implemented by the randomly initialized model.

Abstract

Trained transformer models have been found to implement interpretable procedures for tasks like arithmetic and associative recall, but little is understood about how the circuits that implement these procedures originate during training. To what extent do they depend on the supervisory signal provided to models, and to what extent are they attributable to behavior already present in models at the beginning of training? To investigate these questions, we investigate what functions can be learned by randomly initialized transformers in which only the embedding layers are optimized, so that the only input--output mappings learnable from data are those already implemented (up to a choice of encoding scheme) by the randomly initialized model. We find that these random transformers can perform a wide range of meaningful algorithmic tasks, including modular arithmetic, in-weights and in-context associative recall, decimal addition, parenthesis balancing, and even some aspects of natural language text generation. Our results indicate that some algorithmic capabilities are present in transformers (and accessible via appropriately structured inputs) even before these models are trained. Code is available at https://github.com/fjzzq2002/random_transformers.
Paper Structure (54 sections, 6 equations, 8 figures, 9 tables)

This paper contains 54 sections, 6 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview of problem setup. A: Modeling approach. We initialize transformers randomly, then optimize only their input and output embedding layers on a dataset of interest. We find that these random transformers can be successfully trained to perform a diverse set of human-meaningful tasks. B: Task set. We evaluate the effectiveness of random transformers on a set of model problems involving arithmetic and memorization, as well as modeling of natural language text.
  • Figure 2: Attention patterns observed in a 2-layer 1024-width random transformer trained on the needle-in-a-haystack task. The input sequence is a 1 b 2 c 3 d 4 b. The layer-1 head is used by values to attend to their markers, and the layer-2 head is used by the query to attend to its associated value.
  • Figure 3: Circular embedding observed in random transformer on modular addition. We plot the projection of the embedding matrix onto its first and third principal components, with tokens colored according to their numeric value.
  • Figure 4: Language modeling performances (measured in cross-entropy loss or equivalently log perplexity, lower is better) for fully trained and random transformers. Comparatively large hidden sizes are needed for random models to match the performance of fully trained models.
  • Figure 5: Sample completion generated by fully trained and random 512-width 2-layer transformers. While random models produce less coherent than fully trained models, they nonetheless generate text that is largely grammatical topically appropriate.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition F.1: Intermediate complexity of tasks (informal)