Table of Contents
Fetching ...

How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks

Wanda Hou, Leon Zhou, Hong-Ye Hu, Yubei Chen, Yi-Zhuang You, Xiao-Liang Qi

TL;DR

This paper studies how large language models perform on deterministic, repeatable tasks where a single correct output exists, introducing Sequence Accuracy Rate ($SAR$) as a binary measure of exact sequence correctness. It presents three benchmarks—cyclic letter replacement, integer addition, and Pauli string multiplication—to quantify how $SAR$ decays with output length and reveals a sharp accuracy cliff beyond a characteristic length $N_*$. To explain this, the authors propose a spin-glass–inspired model based on the Sherrington–Kirkpatrick framework, where token correctness is coupled via random interactions with parameters $\beta_0$ (intrinsic error) and $\alpha$ (correlation amplification), giving $SAR(N) = \exp[-\beta(N) N]$ with $\beta(N) = \beta_0 \alpha^{N-1}$ and $N_* \approx 1 + \log(1/\beta_0)/\log \alpha$. They further show that a divide-and-conquer strategy reduces effective correlation by scaling $\alpha$ to $\alpha^{1/k}$, extending the reliable length scale approximately linearly with the number of sub-tasks $k$. The work provides both a quantitative diagnostic (via a disorder-averaged SK model) and a practical mitigation (segmentation) for improving deterministic reasoning reliability in LLMs, with implications for prompting, verification, and architecture design.

Abstract

We investigate the performance of large language models on repetitive deterministic prediction tasks and study how the sequence accuracy rate scales with output length. Each such task involves repeating the same operation n times. Examples include letter replacement in strings following a given rule, integer addition, and multiplication of string operators in many body quantum mechanics. If the model performs the task through a simple repetition algorithm, the success rate should decay exponentially with sequence length. In contrast, our experiments on leading large language models reveal a sharp double exponential drop beyond a characteristic length scale, forming an accuracy cliff that marks the transition from reliable to unstable generation. This indicates that the models fail to execute each operation independently. To explain this phenomenon, we propose a statistical physics inspired model that captures the competition between external conditioning from the prompt and internal interference among generated tokens. The model quantitatively reproduces the observed crossover and provides an interpretable link between attention induced interference and sequence level failure. Fitting the model to empirical results across multiple models and tasks yields effective parameters that characterize the intrinsic error rate and error accumulation factor for each model task pair, offering a principled framework for understanding the limits of deterministic accuracy in large language models.

How Focused Are LLMs? A Quantitative Study via Repetitive Deterministic Prediction Tasks

TL;DR

This paper studies how large language models perform on deterministic, repeatable tasks where a single correct output exists, introducing Sequence Accuracy Rate () as a binary measure of exact sequence correctness. It presents three benchmarks—cyclic letter replacement, integer addition, and Pauli string multiplication—to quantify how decays with output length and reveals a sharp accuracy cliff beyond a characteristic length . To explain this, the authors propose a spin-glass–inspired model based on the Sherrington–Kirkpatrick framework, where token correctness is coupled via random interactions with parameters (intrinsic error) and (correlation amplification), giving with and . They further show that a divide-and-conquer strategy reduces effective correlation by scaling to , extending the reliable length scale approximately linearly with the number of sub-tasks . The work provides both a quantitative diagnostic (via a disorder-averaged SK model) and a practical mitigation (segmentation) for improving deterministic reasoning reliability in LLMs, with implications for prompting, verification, and architecture design.

Abstract

We investigate the performance of large language models on repetitive deterministic prediction tasks and study how the sequence accuracy rate scales with output length. Each such task involves repeating the same operation n times. Examples include letter replacement in strings following a given rule, integer addition, and multiplication of string operators in many body quantum mechanics. If the model performs the task through a simple repetition algorithm, the success rate should decay exponentially with sequence length. In contrast, our experiments on leading large language models reveal a sharp double exponential drop beyond a characteristic length scale, forming an accuracy cliff that marks the transition from reliable to unstable generation. This indicates that the models fail to execute each operation independently. To explain this phenomenon, we propose a statistical physics inspired model that captures the competition between external conditioning from the prompt and internal interference among generated tokens. The model quantitatively reproduces the observed crossover and provides an interpretable link between attention induced interference and sequence level failure. Fitting the model to empirical results across multiple models and tasks yields effective parameters that characterize the intrinsic error rate and error accumulation factor for each model task pair, offering a principled framework for understanding the limits of deterministic accuracy in large language models.

Paper Structure

This paper contains 25 sections, 1 theorem, 70 equations, 9 figures.

Key Result

Theorem 1

Let the intrinsic error rate $\beta_0>0$ and the correlation amplification factor $\alpha>1$, and fix the integer $k\ge2$. Under the empirical scaling law $\mathrm{SAR}(N)=\exp[-\beta_0 N \alpha^{\,N-1}]$, for the segmentation into $k$ equal sub-tasks to yield a positive gain $\Delta(N,k)>0$, it wou meaning that for all $N \ge N_{\mathrm{DC}}$, one has $\mathrm{SAR}_{\mathrm{DC}}(N,k) > \mathrm{SA

Figures (9)

  • Figure 1: Illustration of the experimental setup and theoretical framework. The left panel shows examples of deterministic sequence prediction tasks (e.g., arithmetic problems) used to evaluate large language models (LLMs), where each token’s correctness is represented by a binary variable (Ising spin). Token errors are modeled as Ising spins interacting through all-to-all random couplings, capturing how correlations and noise propagate during sequence generation. The right panel displays the typical behavior of the Sequence Accuracy Rate (SAR) (i.e. the probability for all tokens to be correct in a output sequence) as a function of the sequence length $N$: SAR remains high for short sequences, then drops sharply beyond a characteristic crossover scale $N_*$, forming the accuracy cliff, which is nicely reproduced by a spin-glass statistical-mechanical model.
  • Figure 2: Cyclic letter replacement benchmark on gemini-2.5-pro and gemini-2.5-flash with alphabet size $|\mathcal{A}|=4,9,13$ and $26$.
  • Figure 3: Cyclic letter replacement with alphabet size $|\mathcal{A}|=13$ across many different models.
  • Figure 4: Integer addition benchmark across different models.
  • Figure 5: Pauli string multiplication benchmark on gemini-2.5-pro and gemini-2.5-flash evaluated under strict (phase-matched) and relaxed (phase-ignored) criteria.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1: Advantage of Divide-and-Conquer
  • proof