Table of Contents
Fetching ...

State space models can express n-gram languages

Vinoth Nandakumar, Qiang Qu, Peng Mi, Tongliang Liu

TL;DR

The paper develops a rigorous framework showing that state-space language models can express languages defined by $n$-gram rules and that the context window can be controlled by restricting the spectrum of the state-transition matrix. It proves, via a constructive encoding, that an $f^*_\mathcal{S}$ with embedding dimension $e$ and a single hidden layer of size $|\mathcal{P}|$ can be $\epsilon$-equivalent to any $n$-gram model on the target language, and it analyzes memorization capacity and context bounds with nilpotent $A$. The authors extend these ideas to recurrent neural networks, provide experimental validation on toy data, and discuss how these insights translate to Transformer architectures. The work advances interpretability and architectural design for language modeling by linking combinatorial $n$-gram rules to the internal representations and memory of SSMs and RNNs, with practical implications for efficient, parallelizable modeling.

Abstract

Recent advancements in recurrent neural networks (RNNs) have reinvigorated interest in their application to natural language processing tasks, particularly with the development of more efficient and parallelizable variants known as state space models (SSMs), which have shown competitive performance against transformer models while maintaining a lower memory footprint. While RNNs and SSMs (e.g., Mamba) have been empirically more successful than rule-based systems based on n-gram models, a rigorous theoretical explanation for this success has not yet been developed, as it is unclear how these models encode the combinatorial rules that govern the next-word prediction task. In this paper, we construct state space language models that can solve the next-word prediction task for languages generated from n-gram rules, thereby showing that the former are more expressive. Our proof shows how SSMs can encode n-gram rules using new theoretical results on their memorization capacity, and demonstrates how their context window can be controlled by restricting the spectrum of the state transition matrix. We conduct experiments with a small dataset generated from n-gram rules to show how our framework can be applied to SSMs and RNNs obtained through gradient-based optimization.

State space models can express n-gram languages

TL;DR

The paper develops a rigorous framework showing that state-space language models can express languages defined by -gram rules and that the context window can be controlled by restricting the spectrum of the state-transition matrix. It proves, via a constructive encoding, that an with embedding dimension and a single hidden layer of size can be -equivalent to any -gram model on the target language, and it analyzes memorization capacity and context bounds with nilpotent . The authors extend these ideas to recurrent neural networks, provide experimental validation on toy data, and discuss how these insights translate to Transformer architectures. The work advances interpretability and architectural design for language modeling by linking combinatorial -gram rules to the internal representations and memory of SSMs and RNNs, with practical implications for efficient, parallelizable modeling.

Abstract

Recent advancements in recurrent neural networks (RNNs) have reinvigorated interest in their application to natural language processing tasks, particularly with the development of more efficient and parallelizable variants known as state space models (SSMs), which have shown competitive performance against transformer models while maintaining a lower memory footprint. While RNNs and SSMs (e.g., Mamba) have been empirically more successful than rule-based systems based on n-gram models, a rigorous theoretical explanation for this success has not yet been developed, as it is unclear how these models encode the combinatorial rules that govern the next-word prediction task. In this paper, we construct state space language models that can solve the next-word prediction task for languages generated from n-gram rules, thereby showing that the former are more expressive. Our proof shows how SSMs can encode n-gram rules using new theoretical results on their memorization capacity, and demonstrates how their context window can be controlled by restricting the spectrum of the state transition matrix. We conduct experiments with a small dataset generated from n-gram rules to show how our framework can be applied to SSMs and RNNs obtained through gradient-based optimization.
Paper Structure (26 sections, 10 theorems, 24 equations, 4 figures)

This paper contains 26 sections, 10 theorems, 24 equations, 4 figures.

Key Result

Theorem 4.1

Let $\mathcal{L}$ be a language over a vocabulary $\mathcal{W}$. Let $f^*_{ng}$ be an $n$-gram language model, obtained from the datum $(\mathcal{P}, f_{ng})$. Given any $\epsilon > 0$, there exists a state space language model $f^*_{\mathcal{S}}$ with the following properties:

Figures (4)

  • Figure 1: This diagram illustrates our framework for encoding $n$-gram rules with state space models. For example, the blue dots in the hidden state correspond to input sequences ending in “… go back to”. The hidden state embedding vectors have been projected to two-dimensional space, and the vectors corresponding to the same $n$-gram form clusters. The output logits represent the next-word probabilities, and here the non-zero values correspond to the words that can occur next in the sequence (see Figure \ref{['fig:ngram']} for more details about this example of an $n$-gram model, and Figure \ref{['fig:comparison']} for a more detailed cluster plot).
  • Figure 2: An example of an language $\mathcal{L}$ that can be modelled by $n$-grams, based on text from J.K. Rowling's "Harry Potter". The graph illustrates how sentences are formed, with each of the symbols A, B, C, D being replaced by one of the words in the corresponding box. Two examples of sentences from this language are "Ron woke at seven o’clock and was too upset to go back to bed." and "Sirius woke at five o’clock and was too elated to go back to sleep." See Appendix B for an expanded example, including a full list of $n$-gram rules.
  • Figure 3: This figure shows how the context window state space model can be bounded using a nilpotent state transition matrix $A$. In this example, $A^3 = 0$, as shown by the arrows which represent the action of $A$ on the individual neurons, which map each neuron to a neuron with a larger index. These neurons represent the Jordan basis for the nilpotent, and the arrows indicate how all inputs prior to $x_{T-2}$ do not influence the output $y_T$.
  • Figure 4: Plot showing the eigenvalues of the matrix $A$, and hidden state embeddings corresponding to $n$-grams.

Theorems & Definitions (31)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 3.4
  • Definition 3.5
  • Definition 3.6
  • Definition 3.7
  • Definition 3.8
  • Theorem 4.1
  • Definition 4.2
  • ...and 21 more