State space models can express n-gram languages
Vinoth Nandakumar, Qiang Qu, Peng Mi, Tongliang Liu
TL;DR
The paper develops a rigorous framework showing that state-space language models can express languages defined by $n$-gram rules and that the context window can be controlled by restricting the spectrum of the state-transition matrix. It proves, via a constructive encoding, that an $f^*_\mathcal{S}$ with embedding dimension $e$ and a single hidden layer of size $|\mathcal{P}|$ can be $\epsilon$-equivalent to any $n$-gram model on the target language, and it analyzes memorization capacity and context bounds with nilpotent $A$. The authors extend these ideas to recurrent neural networks, provide experimental validation on toy data, and discuss how these insights translate to Transformer architectures. The work advances interpretability and architectural design for language modeling by linking combinatorial $n$-gram rules to the internal representations and memory of SSMs and RNNs, with practical implications for efficient, parallelizable modeling.
Abstract
Recent advancements in recurrent neural networks (RNNs) have reinvigorated interest in their application to natural language processing tasks, particularly with the development of more efficient and parallelizable variants known as state space models (SSMs), which have shown competitive performance against transformer models while maintaining a lower memory footprint. While RNNs and SSMs (e.g., Mamba) have been empirically more successful than rule-based systems based on n-gram models, a rigorous theoretical explanation for this success has not yet been developed, as it is unclear how these models encode the combinatorial rules that govern the next-word prediction task. In this paper, we construct state space language models that can solve the next-word prediction task for languages generated from n-gram rules, thereby showing that the former are more expressive. Our proof shows how SSMs can encode n-gram rules using new theoretical results on their memorization capacity, and demonstrates how their context window can be controlled by restricting the spectrum of the state transition matrix. We conduct experiments with a small dataset generated from n-gram rules to show how our framework can be applied to SSMs and RNNs obtained through gradient-based optimization.
