Table of Contents
Fetching ...

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati

TL;DR

The paper assesses whether large language models (LLMs) can plan and how a new Large Reasoning Model (LRM), OpenAI's o1, performs on PlanBench. It demonstrates that vanilla LLMs struggle to plan robustly, with strong results on simple, original tasks but collapse on obfuscated or larger problems, while LRMs like o1 show substantial improvements but still fall short of robust, guaranteed performance and incur high costs. The study highlights major issues around efficiency, pricing, and the lack of formal correctness guarantees for LRMs, underscoring the need for hybrid approaches (e.g., LLM-Modulo, classical planners) and richer evaluation tools. Overall, the work provides a nuanced snapshot of current planning capabilities and calls for more rigorous, cost-aware, and guarantee-focused evaluation frameworks for future planning systems.

Abstract

The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs--making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1's performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems.

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

TL;DR

The paper assesses whether large language models (LLMs) can plan and how a new Large Reasoning Model (LRM), OpenAI's o1, performs on PlanBench. It demonstrates that vanilla LLMs struggle to plan robustly, with strong results on simple, original tasks but collapse on obfuscated or larger problems, while LRMs like o1 show substantial improvements but still fall short of robust, guaranteed performance and incur high costs. The study highlights major issues around efficiency, pricing, and the lack of formal correctness guarantees for LRMs, underscoring the need for hybrid approaches (e.g., LLM-Modulo, classical planners) and richer evaluation tools. Overall, the work provides a nuanced snapshot of current planning capabilities and calls for more rigorous, cost-aware, and guarantee-focused evaluation frameworks for future planning systems.

Abstract

The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs--making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1's performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems.
Paper Structure (20 sections, 4 figures, 4 tables)

This paper contains 20 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: These examples are on Mystery Blocksworld. Fast Downward, a domain-independent planner helmert2006fast solves all given instances near-instantly with guaranteed perfect accuracy. LLMs struggle on even the smallest instances. The two LRMs we tested, o1-preview and o1-mini, are surprisingly effective, but this performance is still not robust, and degrades quickly with length.
  • Figure 2:
  • Figure 3: Extending even the (regular, not obfuscated) Blocksworld dataset to problems requiring greater numbers of steps worsens the performance of o1-preview. When tested on 110 instances which each require at least 20 steps to solve, it only manages 23.63%.
  • Figure 4: The number of reasoning tokens used by o1-preview when solving Blocksworld instances does not track the number of nodes that need to be expanded to solve the problem.