Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

Gyuryang Heo; Timothy Ngotiaoco; Kazuki Irie; Samuel J. Gershman; Bernardo Sabatini

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

Gyuryang Heo, Timothy Ngotiaoco, Kazuki Irie, Samuel J. Gershman, Bernardo Sabatini

TL;DR

Echoing recent theoretical studies, the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds are characterized and an approximation error bound is derived and it is shown that error diminishes exponentially as the depth increases.

Abstract

Scalable sequence models, such as Transformer variants and structured state-space models, often trade expressivity power for sequence-level parallelism, which enables efficient training. Here we examine the bounds on error and how error scales when models operate outside of their expressivity regimes using a Lie-algebraic control perspective. Our theory formulates a correspondence between the depth of a sequence model and the tower of Lie algebra extensions. Echoing recent theoretical studies, we characterize the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds. Furthermore, we analytically derive an approximation error bound and show that error diminishes exponentially as the depth increases, consistent with the strong empirical performance of these models. We validate our theoretical predictions using experiments on symbolic word and continuous-valued state-tracking problems.

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

TL;DR

Abstract

Paper Structure (59 sections, 7 theorems, 88 equations, 4 figures, 3 tables)

This paper contains 59 sections, 7 theorems, 88 equations, 4 figures, 3 tables.

Introduction
Mathematical Preliminaries
Lie groups and Lie algebras
Classes of Lie algebras
State-space models and the Lie equation
From state-centric to flow-centric view
Restricted SSM
Simulation and error
Magnus expansion
Deep structure
Word problems
Theory
Expressivity bounds of a single layer
Expressivity bounds at deep structure
Experiments
...and 44 more sections

Key Result

Lemma 3.1

No abelian $\hat{\mathbf{S}}$ can simulate a general SSM.

Figures (4)

Figure 1: Geometric intuition of Lie theory. Initialized at point $e$, consider actions$\mathbf{A}$ and $\mathbf{B}$ sequentially, and undo actions $\mathbf{B}^{-1}$ then $\mathbf{A}^{-1}$. Composition of actions $\mathbf{A}\mathbf{B}\mathbf{B}^{-1}\mathbf{A}^{-1}$ would return to $e$. However, switching the order of undoing, by using $\mathbf{A}\mathbf{B}\mathbf{A}^{-1}\mathbf{B}^{-1}$, might not return to $e$, incurring a discrepancy by landing on point $e'$. Lie theory provides a useful measure of the potential offset between $e$ and $e'$.
Figure 2: Maximum sequence length for each model with varying number of layers to achieve $>90\%$ sequence-level prediction accuracy on the training set of the word problem $A_5$. All models are trained on length up to 128. Deep models ($>$ 4 layers) that failed to achieve a longer sequence length than shallower models are not shown; deep GLA and signed Mamba models often fail to learn this task. $T$ in the legend $\bigl(\lceil \log_2T \rceil + 1\bigr)$ is sequence length (i.e., y-axis).
Figure 3: Mean squared error on the rotated vector prediction task at various sequence length for transformer, GLA, and signed Mamba, with different numbers of layers. Left column illustrates result on training sets, and right corresponds to test sets. Standard error is shown using 3 seeds. Performance of DeltaProduct with 4 Householder products is shown as a reference.
Figure 4: Mean squared loss on the rotated vector prediction task at various sequence length for AUSSM, with different numbers of layers, on train (left) and test (right) sets. Standard error is shown using 3 seeds. Performance of DeltaProduct with 4 Householder products is shown as a reference.

Theorems & Definitions (37)

Definition 2.1: State space model
Definition 2.2: State-transition matrix
Definition 2.3: Controlled Lie equation
Definition 2.4: Lift
Definition 2.5: Commutator mass
Definition 2.6: Deep SSM
Lemma 3.1
Theorem 3.2
proof : Proof sketch
Proposition 3.3
...and 27 more

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

TL;DR

Abstract

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (37)