Why Are Linear RNNs More Parallelizable?

William Merrill; Hongjian Jiang; Yanhong Li; Anthony Lin; Ashish Sabharwal

Why Are Linear RNNs More Parallelizable?

William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal

TL;DR

The theory identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete).

Abstract

The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

Why Are Linear RNNs More Parallelizable?

TL;DR

The theory identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are

-complete whereas diagonal-plus-low-rank LRNNs are more expressive (

-complete).

Abstract

-complete problems (and even

-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are

-complete whereas diagonal-plus-low-rank LRNNs are more expressive (

-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

Paper Structure (76 sections, 47 theorems, 105 equations, 2 figures, 1 table)

This paper contains 76 sections, 47 theorems, 105 equations, 2 figures, 1 table.

Introduction
Preliminaries
Datatypes
Nonlinear RNNs
Linear RNNs
Multilayer RNNs
Precision.
Circuit Complexity
Parallelizability Limits of Nonlinear RNNs
Poly-Precision Nonlinear RNNs Are $\mathsf{P}$-Complete
Log-Precision Nonlinear RNNs Are $\mathsf{L}$-Complete
Simulating LRNNs with Parallel Circuits
Expressivity Differences Between LRNNs
DPLR LRNNs are $\mathsf{PNC}^1$-Complete
PD LRNNs are $\mathsf{NC}^1$-Complete
...and 61 more sections

Key Result

Lemma 1

If $f : \mathbb{Q}^k \to \mathbb{Q}$ has an arithmetic closed form, then $f$ can be computed in $\mathsf{FO}$-uniform $\mathsf{TC}^0$.

Figures (2)

Figure 1: Main results, summarized as a hierarchy of increasingly expressive RNN classes with popular models within each class. Each RNN shown is "tight" for the respective complexity class $\mathsf C$ ($\mathsf{PNC}^1$, $\mathsf{L}$, etc.) in the sense that both falls in $\mathsf C$ and can solve a $\mathsf C$-complete problem. LRNNs are in $\mathsf{PNC}^1$, implying they can be nearly as efficiently parallelized as transformers, incurring only a small $O(\log^*(n))$ depth overhead in terms of bounded fan-in boolean circuits. Nonlinear RNNs, in contrast, can solve $\mathsf{L}$-complete and even $\mathsf{P}$-complete problems, but this comes at the cost of being less parallelizable, requiring notably deeper circuits. Bottom row lists the classes of automata that each RNN model class can simulate, where WFA stands for weighted finite automaton and DWFA stands for deterministic WFA.
Figure 2: Accuracy across size/length ranges. (a) Deterministic graph connectivity over graph-size ranges. (b) Iterated $3\times 3$ matrix multiplication over $\mathbb{Z}_m$. (c) Iterated $3\times 3$ matrix multiplication over $\mathbb{Z}$. Models are evaluated on in-distribution ranges $[1,100]$ and out-of-distribution ranges $[101,200]$ and $[201,300]$.

Theorems & Definitions (94)

Definition 1
Definition 2: Arithmetic Closed Form
Lemma 1: chiang2025transformers
Definition 3: Nonlinear RNN
Definition 4: LRNN
Definition 5: DPLR
Definition 6: PD; terzic2025structured
Definition 7: Multihead RNN Sublayer
Definition 8: Feedforward Sublayer
Definition 9: Language Recognition
...and 84 more

Why Are Linear RNNs More Parallelizable?

TL;DR

Abstract

Why Are Linear RNNs More Parallelizable?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (94)