Table of Contents
Fetching ...

Why Are Linear RNNs More Parallelizable?

William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal

TL;DR

The theory identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete).

Abstract

The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

Why Are Linear RNNs More Parallelizable?

TL;DR

The theory identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are -complete whereas diagonal-plus-low-rank LRNNs are more expressive (-complete).

Abstract

The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve -complete problems (and even -complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are -complete whereas diagonal-plus-low-rank LRNNs are more expressive (-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.
Paper Structure (76 sections, 47 theorems, 105 equations, 2 figures, 1 table)

This paper contains 76 sections, 47 theorems, 105 equations, 2 figures, 1 table.

Key Result

Lemma 1

If $f : \mathbb{Q}^k \to \mathbb{Q}$ has an arithmetic closed form, then $f$ can be computed in $\mathsf{FO}$-uniform $\mathsf{TC}^0$.

Figures (2)

  • Figure 1: Main results, summarized as a hierarchy of increasingly expressive RNN classes with popular models within each class. Each RNN shown is "tight" for the respective complexity class $\mathsf C$ ($\mathsf{PNC}^1$, $\mathsf{L}$, etc.) in the sense that both falls in $\mathsf C$ and can solve a $\mathsf C$-complete problem. LRNNs are in $\mathsf{PNC}^1$, implying they can be nearly as efficiently parallelized as transformers, incurring only a small $O(\log^*(n))$ depth overhead in terms of bounded fan-in boolean circuits. Nonlinear RNNs, in contrast, can solve $\mathsf{L}$-complete and even $\mathsf{P}$-complete problems, but this comes at the cost of being less parallelizable, requiring notably deeper circuits. Bottom row lists the classes of automata that each RNN model class can simulate, where WFA stands for weighted finite automaton and DWFA stands for deterministic WFA.
  • Figure 2: Accuracy across size/length ranges. (a) Deterministic graph connectivity over graph-size ranges. (b) Iterated $3\times 3$ matrix multiplication over $\mathbb{Z}_m$. (c) Iterated $3\times 3$ matrix multiplication over $\mathbb{Z}$. Models are evaluated on in-distribution ranges $[1,100]$ and out-of-distribution ranges $[101,200]$ and $[201,300]$.

Theorems & Definitions (94)

  • Definition 1
  • Definition 2: Arithmetic Closed Form
  • Lemma 1: chiang2025transformers
  • Definition 3: Nonlinear RNN
  • Definition 4: LRNN
  • Definition 5: DPLR
  • Definition 6: PD; terzic2025structured
  • Definition 7: Multihead RNN Sublayer
  • Definition 8: Feedforward Sublayer
  • Definition 9: Language Recognition
  • ...and 84 more