Table of Contents
Fetching ...

Hierarchical Latent Structures in Data Generation Process Unify Mechanistic Phenomena across Scale

Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych

TL;DR

This work uses probabilistic context-free grammars (PCFGs) to generate synthetic corpora that are faithful and computationally efficient proxies for web-scale text corpora and provides the theoretical underpinnings of the role played by hierarchy in the training dynamics of language models.

Abstract

Contemporary studies have uncovered many puzzling phenomena in the neural information processing of Transformer-based language models. Building a robust, unified understanding of these phenomena requires disassembling a model within the scope of its training. While the intractable scale of pretraining corpora limits a bottom-up investigation in this direction, simplistic assumptions of the data generation process limit the expressivity and fail to explain complex patterns. In this work, we use probabilistic context-free grammars (PCFGs) to generate synthetic corpora that are faithful and computationally efficient proxies for web-scale text corpora. We investigate the emergence of three mechanistic phenomena: induction heads, function vectors, and the Hydra effect, under our designed data generation process, as well as in the checkpoints of real-world language models. Our findings suggest that hierarchical structures in the data generation process serve as the X-factor in explaining the emergence of these phenomena. We provide the theoretical underpinnings of the role played by hierarchy in the training dynamics of language models. In a nutshell, our work is the first of its kind to provide a unified explanation behind the emergence of seemingly unrelated mechanistic phenomena in LLMs, augmented with efficient synthetic tooling for future interpretability research.

Hierarchical Latent Structures in Data Generation Process Unify Mechanistic Phenomena across Scale

TL;DR

This work uses probabilistic context-free grammars (PCFGs) to generate synthetic corpora that are faithful and computationally efficient proxies for web-scale text corpora and provides the theoretical underpinnings of the role played by hierarchy in the training dynamics of language models.

Abstract

Contemporary studies have uncovered many puzzling phenomena in the neural information processing of Transformer-based language models. Building a robust, unified understanding of these phenomena requires disassembling a model within the scope of its training. While the intractable scale of pretraining corpora limits a bottom-up investigation in this direction, simplistic assumptions of the data generation process limit the expressivity and fail to explain complex patterns. In this work, we use probabilistic context-free grammars (PCFGs) to generate synthetic corpora that are faithful and computationally efficient proxies for web-scale text corpora. We investigate the emergence of three mechanistic phenomena: induction heads, function vectors, and the Hydra effect, under our designed data generation process, as well as in the checkpoints of real-world language models. Our findings suggest that hierarchical structures in the data generation process serve as the X-factor in explaining the emergence of these phenomena. We provide the theoretical underpinnings of the role played by hierarchy in the training dynamics of language models. In a nutshell, our work is the first of its kind to provide a unified explanation behind the emergence of seemingly unrelated mechanistic phenomena in LLMs, augmented with efficient synthetic tooling for future interpretability research.
Paper Structure (27 sections, 4 theorems, 19 equations, 7 figures, 3 tables, 3 algorithms)

This paper contains 27 sections, 4 theorems, 19 equations, 7 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

If latent variable $Z$ influences multiple distant positions, then any finite-capacity model minimizing expected autoregressive loss must reuse earlier latent inferences to predict later tokens. It needs to implement a distance-invariant retrieval of past latent information and a similarity metric o

Figures (7)

  • Figure 1: Overview of the experimental setup.
  • Figure 2: $k$-order induction heads across training. Accuracy of attention values towards induction-relevant token.
  • Figure 3: Layer-wise function vector improvement across training for the PCFG. Improvement measured by injecting the attention of a contextualized representation into a zero-shot setting.
  • Figure 4: Layer-wise Hydra effect development across training. Drop in ground-truth logits at layer $\ell$ caused by ablating its precursor layer $\ell-m$ across training steps. Positive value (blue) denotes Hydra effect at layer $\ell$.
  • Figure 5: Hierarchy is internalized in stages during training. (a) Probability mass towards the next exclusively valid tokens saturates in the beginning, corresponding to shallower hierarchy learning. (b) Layer-wise structural probe accuracy rapidly improves after a substantial training (deep hierarchy learning).
  • ...and 2 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Corollary 1.1
  • Lemma 2
  • Theorem 3