Softmax Transformers are Turing-Complete
Hongjian Jiang, Michael Hahn, Georg Zetzsche, Anthony Widjaja Lin
TL;DR
The paper investigates whether softmax-attention transformers with Chain-of-Thought (CoT) steps can be Turing-complete. It shows that, by embedding CoT within C-RASP counting mechanisms, length-generalizable softmax CoT transformers are capable of simulating Minsky counter machines, establishing Turing-completeness for unary and letter-bounded languages, and extending to arbitrary languages with Relative Positional Encodings (RPEs). It proves that CoT C-RASP without RPEs fails for general languages but becomes fully Turing-complete when equipped with RPEs, supported by a formal construction and empirical demonstrations on arithmetic languages. The authors further demonstrate length-generalizable learnability within the len-gen-huang framework for languages expressible in C-RASP CoT, and provide empirical validation showing that unary representations generalize with NoPE while binary representations require RPEs for robust generalization. Overall, the work bridges formal language theory and transformer architectures, offering theoretical guarantees for trainability and practical insights for arithmetic reasoning in CoT-enabled models, with implications for understanding the limits and capabilities of softmax transformers in reasoning tasks.
Abstract
Hard attention Chain-of-Thought (CoT) transformers are known to be Turing-complete. However, it is an open problem whether softmax attention Chain-of-Thought (CoT) transformers are Turing-complete. In this paper, we prove a stronger result that length-generalizable softmax CoT transformers are Turing-complete. More precisely, our Turing-completeness proof goes via the CoT extension of the Counting RASP (C-RASP), which correspond to softmax CoT transformers that admit length generalization. We prove Turing-completeness for CoT C-RASP with causal masking over a unary alphabet (more generally, for letter-bounded languages). While we show this is not Turing-complete for arbitrary languages, we prove that its extension with relative positional encoding is Turing-complete for arbitrary languages. We empirically validate our theory by training transformers for languages requiring complex (non-linear) arithmetic reasoning.
