Table of Contents
Fetching ...

Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1

Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, Subbarao Kambhampati

TL;DR

The paper systematically evaluates OpenAI o1 Strawberry LRMs on planning and scheduling benchmarks, contrasting them with traditional LLMs and highlighting both progress and persistent gaps. It demonstrates that LRMs can outperform autoregressive models in several planning tasks but incur high inference costs and lack formal guarantees. To address reliability, the authors adapt the LLM-Modulo paradigm by pairing LRMs with sound verifiers, achieving substantial performance gains and correctness guarantees across multiple hard domains. The work underscores the potential of LRMs in combined verification loops while emphasizing the need for efficiency, cost-awareness, and domain-specific guarantees for real-world deployment.

Abstract

The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities, but -- despite the slew of new private and open source LLMs since GPT3 -- progress has remained slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs -- making it a new kind of model: a Large Reasoning Model (LRM). In this paper, we evaluate the planning capabilities of two LRMs (o1-preview and o1-mini) on both planning and scheduling benchmarks. We see that while o1 does seem to offer significant improvements over autoregressive LLMs, this comes at a steep inference cost, while still failing to provide any guarantees over what it generates. We also show that combining o1 models with external verifiers -- in a so-called LRM-Modulo system -- guarantees the correctness of the combined system's output while further improving performance.

Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1

TL;DR

The paper systematically evaluates OpenAI o1 Strawberry LRMs on planning and scheduling benchmarks, contrasting them with traditional LLMs and highlighting both progress and persistent gaps. It demonstrates that LRMs can outperform autoregressive models in several planning tasks but incur high inference costs and lack formal guarantees. To address reliability, the authors adapt the LLM-Modulo paradigm by pairing LRMs with sound verifiers, achieving substantial performance gains and correctness guarantees across multiple hard domains. The work underscores the potential of LRMs in combined verification loops while emphasizing the need for efficiency, cost-awareness, and domain-specific guarantees for real-world deployment.

Abstract

The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities, but -- despite the slew of new private and open source LLMs since GPT3 -- progress has remained slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs -- making it a new kind of model: a Large Reasoning Model (LRM). In this paper, we evaluate the planning capabilities of two LRMs (o1-preview and o1-mini) on both planning and scheduling benchmarks. We see that while o1 does seem to offer significant improvements over autoregressive LLMs, this comes at a steep inference cost, while still failing to provide any guarantees over what it generates. We also show that combining o1 models with external verifiers -- in a so-called LRM-Modulo system -- guarantees the correctness of the combined system's output while further improving performance.
Paper Structure (52 sections, 4 figures, 5 tables)

This paper contains 52 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: These examples are on Mystery Blocksworld. Fast Downward, a domain-independent planner helmert2006fast solves all given instances near-instantly with guaranteed perfect accuracy. LLMs struggle on even the smallest instances. The two LRMs we tested, o1-preview and o1-mini, are surprisingly effective, but this performance is still not robust, and degrades quickly with length.
  • Figure 2: Extending even the (regular, not obfuscated) Blocksworld dataset to problems requiring greater numbers of steps worsens the performance of o1-preview. When tested on 110 instances which each require at least 20 steps to solve, it only manages 23.63%.
  • Figure 3: LRM-Modulo significantly improves performance over direct prompting as we increase the number of iterations.
  • Figure 4: The number of reasoning tokens used by o1-preview when solving Blocksworld instances does not track the number of nodes that need to be expanded to solve the problem.