How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks
Wanda Hou, Leon Zhou, Hong-Ye Hu, Yubei Chen, Yi-Zhuang You, Xiao-Liang Qi
TL;DR
This paper studies how large language models perform on deterministic, repeatable tasks where a single correct output exists, introducing Sequence Accuracy Rate ($SAR$) as a binary measure of exact sequence correctness. It presents three benchmarks—cyclic letter replacement, integer addition, and Pauli string multiplication—to quantify how $SAR$ decays with output length and reveals a sharp accuracy cliff beyond a characteristic length $N_*$. To explain this, the authors propose a spin-glass–inspired model based on the Sherrington–Kirkpatrick framework, where token correctness is coupled via random interactions with parameters $\beta_0$ (intrinsic error) and $\alpha$ (correlation amplification), giving $SAR(N) = \exp[-\beta(N) N]$ with $\beta(N) = \beta_0 \alpha^{N-1}$ and $N_* \approx 1 + \log(1/\beta_0)/\log \alpha$. They further show that a divide-and-conquer strategy reduces effective correlation by scaling $\alpha$ to $\alpha^{1/k}$, extending the reliable length scale approximately linearly with the number of sub-tasks $k$. The work provides both a quantitative diagnostic (via a disorder-averaged SK model) and a practical mitigation (segmentation) for improving deterministic reasoning reliability in LLMs, with implications for prompting, verification, and architecture design.
Abstract
We investigate the performance of large language models on repetitive deterministic prediction tasks and study how the sequence accuracy rate scales with output length. Each such task involves repeating the same operation n times. Examples include letter replacement in strings following a given rule, integer addition, and multiplication of string operators in many body quantum mechanics. If the model performs the task through a simple repetition algorithm, the success rate should decay exponentially with sequence length. In contrast, our experiments on leading large language models reveal a sharp double exponential drop beyond a characteristic length scale, forming an accuracy cliff that marks the transition from reliable to unstable generation. This indicates that the models fail to execute each operation independently. To explain this phenomenon, we propose a statistical physics inspired model that captures the competition between external conditioning from the prompt and internal interference among generated tokens. The model quantitatively reproduces the observed crossover and provides an interpretable link between attention induced interference and sequence level failure. Fitting the model to empirical results across multiple models and tasks yields effective parameters that characterize the intrinsic error rate and error accumulation factor for each model task pair, offering a principled framework for understanding the limits of deterministic accuracy in large language models.
