Limits of Deep Learning: Sequence Modeling through the Lens of Complexity Theory
Nikola Zubić, Federico Soldá, Aurelio Sulser, Davide Scaramuzza
TL;DR
This paper analyzes the fundamental limits of Structured State Space Models (SSMs) and Transformers in sequence modeling, focusing on function composition and multi-step reasoning. By framing the problem in complexity-theoretic terms, the authors prove that one-layer SSMs require impractically large state sizes to perform function composition over large domains, and that even Chain-of-Thought prompting yields only a polynomial growth in required steps; multi-layer SSMs are bounded by logarithmic-space constraints, placing them in the class $\mathbf{L}$ and aligning with limitations known for Transformers. They further show that finite-precision SSMs can only recognize regular languages, reinforcing inherent restrictions on memory and expressivity. Empirically, the paper demonstrates significant degradation on composition tasks across multiple models and prompting regimes, including GPT-4o, and reveals systematic error propagation and shortcut behaviors that hinder reliable multi-step reasoning. Together, the theoretical and empirical results argue for novel architectures or hybrid approaches (e.g., neuro-symbolic or external-memory mechanisms) to transcend current computational barriers in pursuit of more general artificial intelligence.
Abstract
Despite their successes, deep learning models struggle with tasks requiring complex reasoning and function composition. We present a theoretical and empirical investigation into the limitations of Structured State Space Models (SSMs) and Transformers in such tasks. We prove that one-layer SSMs cannot efficiently perform function composition over large domains without impractically large state sizes, and even with Chain-of-Thought prompting, they require a number of steps that scale unfavorably with the complexity of the function composition. Also, the language of a finite-precision SSM is within the class of regular languages. Our experiments corroborate these theoretical findings. Evaluating models on tasks including various function composition settings, multi-digit multiplication, dynamic programming, and Einstein's puzzle, we find significant performance degradation even with advanced prompting techniques. Models often resort to shortcuts, leading to compounding errors. These findings highlight fundamental barriers within current deep learning architectures rooted in their computational capacities. We underscore the need for innovative solutions to transcend these constraints and achieve reliable multi-step reasoning and compositional task-solving, which is critical for advancing toward general artificial intelligence.
