Table of Contents
Fetching ...

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

Gyuryang Heo, Timothy Ngotiaoco, Kazuki Irie, Samuel J. Gershman, Bernardo Sabatini

TL;DR

Echoing recent theoretical studies, the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds are characterized and an approximation error bound is derived and it is shown that error diminishes exponentially as the depth increases.

Abstract

Scalable sequence models, such as Transformer variants and structured state-space models, often trade expressivity power for sequence-level parallelism, which enables efficient training. Here we examine the bounds on error and how error scales when models operate outside of their expressivity regimes using a Lie-algebraic control perspective. Our theory formulates a correspondence between the depth of a sequence model and the tower of Lie algebra extensions. Echoing recent theoretical studies, we characterize the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds. Furthermore, we analytically derive an approximation error bound and show that error diminishes exponentially as the depth increases, consistent with the strong empirical performance of these models. We validate our theoretical predictions using experiments on symbolic word and continuous-valued state-tracking problems.

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

TL;DR

Echoing recent theoretical studies, the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds are characterized and an approximation error bound is derived and it is shown that error diminishes exponentially as the depth increases.

Abstract

Scalable sequence models, such as Transformer variants and structured state-space models, often trade expressivity power for sequence-level parallelism, which enables efficient training. Here we examine the bounds on error and how error scales when models operate outside of their expressivity regimes using a Lie-algebraic control perspective. Our theory formulates a correspondence between the depth of a sequence model and the tower of Lie algebra extensions. Echoing recent theoretical studies, we characterize the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds. Furthermore, we analytically derive an approximation error bound and show that error diminishes exponentially as the depth increases, consistent with the strong empirical performance of these models. We validate our theoretical predictions using experiments on symbolic word and continuous-valued state-tracking problems.
Paper Structure (59 sections, 7 theorems, 88 equations, 4 figures, 3 tables)

This paper contains 59 sections, 7 theorems, 88 equations, 4 figures, 3 tables.

Key Result

Lemma 3.1

No abelian $\hat{\mathbf{S}}$ can simulate a general SSM.

Figures (4)

  • Figure 1: Geometric intuition of Lie theory. Initialized at point $e$, consider actions$\mathbf{A}$ and $\mathbf{B}$ sequentially, and undo actions $\mathbf{B}^{-1}$ then $\mathbf{A}^{-1}$. Composition of actions $\mathbf{A}\mathbf{B}\mathbf{B}^{-1}\mathbf{A}^{-1}$ would return to $e$. However, switching the order of undoing, by using $\mathbf{A}\mathbf{B}\mathbf{A}^{-1}\mathbf{B}^{-1}$, might not return to $e$, incurring a discrepancy by landing on point $e'$. Lie theory provides a useful measure of the potential offset between $e$ and $e'$.
  • Figure 2: Maximum sequence length for each model with varying number of layers to achieve $>90\%$ sequence-level prediction accuracy on the training set of the word problem $A_5$. All models are trained on length up to 128. Deep models ($>$ 4 layers) that failed to achieve a longer sequence length than shallower models are not shown; deep GLA and signed Mamba models often fail to learn this task. $T$ in the legend $\bigl(\lceil \log_2T \rceil + 1\bigr)$ is sequence length (i.e., y-axis).
  • Figure 3: Mean squared error on the rotated vector prediction task at various sequence length for transformer, GLA, and signed Mamba, with different numbers of layers. Left column illustrates result on training sets, and right corresponds to test sets. Standard error is shown using 3 seeds. Performance of DeltaProduct with 4 Householder products is shown as a reference.
  • Figure 4: Mean squared loss on the rotated vector prediction task at various sequence length for AUSSM, with different numbers of layers, on train (left) and test (right) sets. Standard error is shown using 3 seeds. Performance of DeltaProduct with 4 Householder products is shown as a reference.

Theorems & Definitions (37)

  • Definition 2.1: State space model
  • Definition 2.2: State-transition matrix
  • Definition 2.3: Controlled Lie equation
  • Definition 2.4: Lift
  • Definition 2.5: Commutator mass
  • Definition 2.6: Deep SSM
  • Lemma 3.1
  • Theorem 3.2
  • proof : Proof sketch
  • Proposition 3.3
  • ...and 27 more