Table of Contents
Fetching ...

Analysis of Optimality of Large Language Models on Planning Problems

Bernd Bohnet, Michael C. Mozer, Kevin Swersky, Wil Cunningham, Aaron Parisi, Kathleen Kenealy, Noah Fiedel

Abstract

Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star ($P^*$) graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the $P^*$ topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.

Analysis of Optimality of Large Language Models on Planning Problems

Abstract

Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star () graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.

Paper Structure

This paper contains 30 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Structural Isomorphism between Blocksworld and $P^*$ Topology.(a) A standard Blocksworld instance illustrating the initial state (left) and the target goal state (right). The task requires rearranging blocks using four standard atomic actions: unstack, stack, pick-up, and put-down. (b) The corresponding $P^*$ graph representation of the initial state. The table serves as the central root node, and each stack of blocks forms a disjoint branch, framing the planning task as a structural path-traversal problem.
  • Figure 2: Scaling Goal Blocks and Tower Height. Gemini 3 Pro vs. LAMA-2011 and the theoretical optimal cost. (a) Increasing goal blocks: plan costs illustrating generalization to retrieving and stacking many goal blocks. (b) Increasing tower height: plan costs for retrieving and unstacking from high towers.
  • Figure 3: Scalability and Compute Allocation.(a) Grand Challenge: plan steps vs. total problem volume ($h \times w \times s$). Gemini 3 Pro tracks optimality into extreme complexities where classical search collapses. (b) Inference-time compute: thinking trace length (tokens) vs. optimal cost $C_{opt}$. All curricula follow a linear trend ($\approx$47 tokens/step); failed instances (gray crosses) cluster at higher token counts.
  • Figure 4: Grand Challenge Example. The simplest problem from the Grand Challenge curriculum with initial state (left) and goal state (right).
  • Figure 5: Interleaved Harvest Example (6 towers, 12 goal blocks). Initial state (left) with goal blocks highlighted in gold, and goal tower (right). The scrambled PDDL encoding and cross-tower dependencies make this problem significantly harder for classical planners than the single-goal Harvest.