Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
Magnus Boman
TL;DR
We address the problem of understanding why LLMs fail on simple tasks by proposing a seven-tape deterministic Turing-machine formulation of the LLM inference pipeline, mapping input characters, tokens, vocab, model parameters, activations, and output through distinct tapes. This mechanistic model enables precise error localization across tokenisation, forward computation (including attention), and detokenisation, and it reframes chain-of-thought prompting as externalising intermediate computation on the output tape. Key contributions include phase-based architecture, phase-specific transition rules, and demonstrative analyses of tokenisation-induced counting errors and centre-embedding failures, grounded in computability concepts. The framework clarifies the strengths and limits of prompting strategies and attention-based computation, offering testable predictions about when and why certain failures occur and guiding principled improvements to prompting and model design. Overall, the multi-tape TM provides a rigorous, interpretable lens that complements empirical scaling laws by linking observed behaviors to formal computational constraints.
Abstract
Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.
