Table of Contents
Fetching ...

Modes of Sequence Models and Learning Coefficients

Zhongtian Chen, Daniel Murfet

TL;DR

This work introduces a geometric framework for sequence models by embedding conditional sequence distributions in a Hilbert space and extracting principal data patterns via tensor decompositions. It defines an effective true distribution $q^{(\chi)}$ through mode truncation of the fundamental tensor, thereby providing a principled coarse-graining of the data distribution. The authors prove that Local Learning Coefficient (LLC) estimates obtained via SGLD are insensitive to higher modes beyond a data-dependent threshold, meaning LLC reflects the geometry of the truncated, effective distribution rather than the full distribution. They also discuss how the inverse temperature in SGLD acts as a resolution dial on landscape structure, with practical implications for interpreting LLC measurements in transformer models and guiding future work on more expressive mode decompositions.

Abstract

We develop a geometric account of sequence modelling that links patterns in the data to measurable properties of the loss landscape in transformer networks. First, we cast conditional sequence distributions into a Hilbert-space framework and apply tensor decompositions to identify their principal modes. Truncating the small-amplitude modes yields an effective data distribution that preserves dominant structure while discarding statistical detail. Second, we show theoretically that Local Learning Coefficient (LLC) estimates are insensitive to modes below a data-dependent threshold. Consequently, the LLC calculated in practice characterises the geometry of the effective rather than the true distribution. This insight clarifies why reliable LLC estimates can be obtained even when a network parameter is not a strict minimiser of the population loss, and it highlights how the inverse temperature in SGLD acts as a resolution dial on the landscape structure.

Modes of Sequence Models and Learning Coefficients

TL;DR

This work introduces a geometric framework for sequence models by embedding conditional sequence distributions in a Hilbert space and extracting principal data patterns via tensor decompositions. It defines an effective true distribution through mode truncation of the fundamental tensor, thereby providing a principled coarse-graining of the data distribution. The authors prove that Local Learning Coefficient (LLC) estimates obtained via SGLD are insensitive to higher modes beyond a data-dependent threshold, meaning LLC reflects the geometry of the truncated, effective distribution rather than the full distribution. They also discuss how the inverse temperature in SGLD acts as a resolution dial on landscape structure, with practical implications for interpreting LLC measurements in transformer models and guiding future work on more expressive mode decompositions.

Abstract

We develop a geometric account of sequence modelling that links patterns in the data to measurable properties of the loss landscape in transformer networks. First, we cast conditional sequence distributions into a Hilbert-space framework and apply tensor decompositions to identify their principal modes. Truncating the small-amplitude modes yields an effective data distribution that preserves dominant structure while discarding statistical detail. Second, we show theoretically that Local Learning Coefficient (LLC) estimates are insensitive to modes below a data-dependent threshold. Consequently, the LLC calculated in practice characterises the geometry of the effective rather than the true distribution. This insight clarifies why reliable LLC estimates can be obtained even when a network parameter is not a strict minimiser of the population loss, and it highlights how the inverse temperature in SGLD acts as a resolution dial on the landscape structure.

Paper Structure

This paper contains 41 sections, 14 theorems, 182 equations, 5 figures, 1 table.

Key Result

Lemma 2.1

The set $\{ u_\alpha \}_{\alpha \in \Lambda^+}$ is an orthonormal basis in $W$ for the image of $A$. We have We call $s_\alpha$ the singular values, $v_\alpha$ the right singular vectors and $u_\alpha$ the left singular vectors.

Figures (5)

  • Figure 1: Empirical modes. We show an example of the $x,y$ pair which are heavily loaded in the first empirical mode for $k = 1$, $l = 1$ in our experiments on the Pile. Shown are three text samples. Next to each $X, Y$ token we show the index in the tokeniser.
  • Figure 2: Empirical modes. We show an example of the $x,y$ pair which are heavily loaded in the first empirical mode for $k = 2$, $l = 1$ in our experiments on the Pile. Shown are three text samples. Next to each $X, Y$ token we show the index in the tokeniser.
  • Figure 3: Empirical modes. We show an example of the $x,y$ pair which are heavily loaded in the first empirical mode for $k = 2$, $l = 1$ in our experiments on the Pile. Shown are three text samples. Next to each $X, Y$ token sequence we show the indices in the tokeniser.
  • Figure 4: Empirical modes. We show an example of the $x,y$ pair which are heavily loaded in the first empirical mode for $k = 3$, $l = 3$ in our experiments on the Pile. Shown are three text samples. Next to each $X, Y$ token sequence we show the indices in the tokeniser.
  • Figure 5: Empirical modes. We show an example of the $x,y$ pair which are heavily loaded in the first empirical mode for $k = 3$, $l = 3$ in our experiments on the Pile. Shown are three text samples. Next to each $X, Y$ token sequence we show the indices in the tokeniser.

Theorems & Definitions (58)

  • Lemma 2.1
  • proof
  • Definition 3.1
  • Definition 3.2
  • Example 3.3
  • Definition 4.1
  • Definition 4.2
  • Remark 4.3
  • Definition 4.4
  • Lemma 4.5
  • ...and 48 more