Table of Contents
Fetching ...

On the Limits of Innate Planning in Large Language Models

Charles Schepanowski, Charles Ling

TL;DR

The paper benchmarks innate planning and state tracking in large language models using the 8-puzzle to isolate planning without tooling. It compares four models across Zero-Shot, Chain-of-Thought, and Algorithm-of-Thought prompts, augmented by tiered feedback and an external move validator. Key findings show persistent deficits in maintaining correct state representations and developing effective long-horizon strategies; even with tool-offloading interventions, models fail to solve puzzles, though feedback can raise performance at substantial compute cost. The study argues that robust autonomous planning likely requires explicit state maintenance and structured search beyond prompting alone, with important implications for deploying LLMs in real-world, sequential-task settings.

Abstract

Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model-prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.

On the Limits of Innate Planning in Large Language Models

TL;DR

The paper benchmarks innate planning and state tracking in large language models using the 8-puzzle to isolate planning without tooling. It compares four models across Zero-Shot, Chain-of-Thought, and Algorithm-of-Thought prompts, augmented by tiered feedback and an external move validator. Key findings show persistent deficits in maintaining correct state representations and developing effective long-horizon strategies; even with tool-offloading interventions, models fail to solve puzzles, though feedback can raise performance at substantial compute cost. The study argues that robust autonomous planning likely requires explicit state maintenance and structured search beyond prompting alone, with important implications for deploying LLMs in real-world, sequential-task settings.

Abstract

Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model-prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.

Paper Structure

This paper contains 42 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Example of our 8-puzzle evaluation pipeline. The model receives the system prompt, in this case, our Zero-Shot prompt, as well as the puzzle. The model then returns a sequence of moves. An external function applies each move to determine whether the model solved the puzzle. We highlight the legal moves in green. In this example, the model solves the puzzle on its first attempt.
  • Figure 2: Example of our feedback pipeline. As in Fig. \ref{['fig:pipeline']}, the model first receives the Zero-Shot system prompt and the puzzle, but its initial attempt fails due to an invalid move (highlighted in red). We go back to the preceding valid state and re-prompt the model with suggestive feedback (shown in bold), after which the model successfully solves the puzzle.
  • Figure 3: Success rates for each model under different prompting strategies across difficulty bins defined by optimal A* solution length. Each bin contains ten puzzles, and the percentage solved is computed within each bin.
  • Figure 4: Termination breakdown of the four LLMs on the 8-puzzle across prompting strategies. Each stacked bar represents 100% of trials for a given model and prompting condition (Zero-Shot, CoT, or AoT). The colored segments show the percentage of trials that ended in success (Solved) or in each failure mode.
  • Figure 5: Success rates of each model under different feedback conditions. The “Original” point represents the initial success rate for each prompting strategy (Zero-Shot, CoT, and AoT). The “Repeat,” “Specific,” and “Suggestive” points show the final success rates after models were given up to three additional attempts on previously failed puzzles with the corresponding type of feedback. Each line tracks the performance of a prompting strategy across feedback conditions.
  • ...and 2 more figures