Table of Contents
Fetching ...

The Illusion of Procedural Reasoning: Measuring Long-Horizon FSM Execution in LLMs

Mahdi Samiei, Mahdi Mansouri, Mahdieh Soleymani Baghshah

TL;DR

Problem: Do LLMs truly perform procedural, long-horizon reasoning or do they rely on short-horizon pattern extrapolation? Approach: Finite-State Machine Execution provides a minimal, interpretable benchmark by presenting an FSM defined as $M=(Q,\Sigma,\delta,q_0)$ to the model, requiring step-by-step state updates and a fixed output format. Metrics: Turn Accuracy (per-turn correctness from the previous state) and Task Accuracy (long-horizon fidelity from the initial state) quantify immediate computation versus state maintenance. Findings: accuracy declines systematically with horizon and branching; scaling improves local rule adherence but does not eliminate long-horizon fragility; rule retrieval under high branching is a primary bottleneck, and externalizing intermediate steps via reasoning or scratchpads can partially mitigate failures. Significance: the FSM framework enables transparent diagnosis, guiding inductive-bias design, external memory, and modular execution strategies toward genuine long-horizon procedural competence in LLMs.

Abstract

Large language models (LLMs) have achieved remarkable results on tasks framed as reasoning problems, yet their true ability to perform procedural reasoning, executing multi-step, rule-based computations remains unclear. Unlike algorithmic systems, which can deterministically execute long-horizon symbolic procedures, LLMs often degrade under extended reasoning chains, but there is no controlled, interpretable benchmark to isolate and measure this collapse. We introduce Finite-State Machine (FSM) Execution as a minimal, fully interpretable framework for evaluating the procedural reasoning capacity of LLMs. In our setup, the model is given an explicit FSM definition and must execute it step-by-step given input actions, maintaining state consistency over multiple turns. This task requires no world knowledge, only faithful application of deterministic transition rules, making it a direct probe of the model's internal procedural fidelity. We measure both Turn Accuracy and Task Accuracy to disentangle immediate computation from cumulative state maintenance. Empirical results reveal systematic degradation as task horizon or branching complexity increases. Models perform significantly worse when rule retrieval involves high branching factors than when memory span is long. Larger models show improved local accuracy but remain brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. FSM-based evaluation offers a transparent, complexity-controlled probe for diagnosing this failure mode and guiding the design of inductive biases that enable genuine long-horizon procedural competence. By grounding reasoning in measurable execution fidelity rather than surface correctness, this work helps establish a rigorous experimental foundation for understanding and improving the algorithmic reliability of LLMs.

The Illusion of Procedural Reasoning: Measuring Long-Horizon FSM Execution in LLMs

TL;DR

Problem: Do LLMs truly perform procedural, long-horizon reasoning or do they rely on short-horizon pattern extrapolation? Approach: Finite-State Machine Execution provides a minimal, interpretable benchmark by presenting an FSM defined as to the model, requiring step-by-step state updates and a fixed output format. Metrics: Turn Accuracy (per-turn correctness from the previous state) and Task Accuracy (long-horizon fidelity from the initial state) quantify immediate computation versus state maintenance. Findings: accuracy declines systematically with horizon and branching; scaling improves local rule adherence but does not eliminate long-horizon fragility; rule retrieval under high branching is a primary bottleneck, and externalizing intermediate steps via reasoning or scratchpads can partially mitigate failures. Significance: the FSM framework enables transparent diagnosis, guiding inductive-bias design, external memory, and modular execution strategies toward genuine long-horizon procedural competence in LLMs.

Abstract

Large language models (LLMs) have achieved remarkable results on tasks framed as reasoning problems, yet their true ability to perform procedural reasoning, executing multi-step, rule-based computations remains unclear. Unlike algorithmic systems, which can deterministically execute long-horizon symbolic procedures, LLMs often degrade under extended reasoning chains, but there is no controlled, interpretable benchmark to isolate and measure this collapse. We introduce Finite-State Machine (FSM) Execution as a minimal, fully interpretable framework for evaluating the procedural reasoning capacity of LLMs. In our setup, the model is given an explicit FSM definition and must execute it step-by-step given input actions, maintaining state consistency over multiple turns. This task requires no world knowledge, only faithful application of deterministic transition rules, making it a direct probe of the model's internal procedural fidelity. We measure both Turn Accuracy and Task Accuracy to disentangle immediate computation from cumulative state maintenance. Empirical results reveal systematic degradation as task horizon or branching complexity increases. Models perform significantly worse when rule retrieval involves high branching factors than when memory span is long. Larger models show improved local accuracy but remain brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. FSM-based evaluation offers a transparent, complexity-controlled probe for diagnosing this failure mode and guiding the design of inductive biases that enable genuine long-horizon procedural competence. By grounding reasoning in measurable execution fidelity rather than surface correctness, this work helps establish a rigorous experimental foundation for understanding and improving the algorithmic reliability of LLMs.

Paper Structure

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: Task accuracy and Turn accuracy comparison for different models
  • Figure 2: Task accuracy comparison for a Wide & Shallow setup vs Deep & Narrow setup.
  • Figure 3: Increasing step size to 2 will result a huge performance degradation on a 4-state/5-action setup. It indicated that steps should also be atomized to reach high task accuracy. Using reasoning in this setup will lead to much higher performance in the cost of reasoning tokens generated by the model.