What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models

Eran Hirsch; Guy Uziel; Ateret Anaby-Tavor

What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models

Eran Hirsch, Guy Uziel, Ateret Anaby-Tavor

TL;DR

This work addresses the gap between large language models and classical planning by diagnosing LLMs’ weaknesses in world modeling and action reasoning, and by proposing SimPlan, a hybrid planner that couples Greedy Best-First Search with external world modeling and a ColBERT-based action-ranking heuristic. The core contribution is a generalized planning framework that maintains explicit world models via an external tool while leveraging LLMs to score actions, enabling efficient state-space exploration with a robust search strategy defined by Cost$(\pi) = -\frac{1}{n}\sum_{i=1}^{n} \log P_\theta(a_i|s_{i-1},G)$. Across five diverse domains (Blocksworld, Ferry, Grippers, Depots, Minigrid) and simple/complex configurations, SimPlan achieves higher success rates than prior LLM-based planners, though challenges remain in the most complex Depots domain and with very large problem sizes. The work introduces a generalized planning setup, data-augmentation to reduce identifier bias, and comprehensive ablations demonstrating the necessity of state updates and external world modeling; these contributions advance practical planning capabilities for real-world agent systems and provide a blueprint for further improvements in hybrid planning.

Abstract

Planning is a fundamental task in artificial intelligence that involves finding a sequence of actions that achieve a specified goal in a given environment. Large language models (LLMs) are increasingly used for applications that require planning capabilities, such as web or embodied agents. In line with recent studies, we demonstrate through experimentation that LLMs lack necessary skills required for planning. Based on these observations, we advocate for the potential of a hybrid approach that combines LLMs with classical planning methodology. Then, we introduce SimPlan, a novel hybrid-method, and evaluate its performance in a new challenging setup. Our extensive experiments across various planning domains demonstrate that SimPlan significantly outperforms existing LLM-based planners.

What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models

TL;DR

. Across five diverse domains (Blocksworld, Ferry, Grippers, Depots, Minigrid) and simple/complex configurations, SimPlan achieves higher success rates than prior LLM-based planners, though challenges remain in the most complex Depots domain and with very large problem sizes. The work introduces a generalized planning setup, data-augmentation to reduce identifier bias, and comprehensive ablations demonstrating the necessity of state updates and external world modeling; these contributions advance practical planning capabilities for real-world agent systems and provide a blueprint for further improvements in hybrid planning.

Abstract

Paper Structure (47 sections, 1 equation, 18 figures, 9 tables, 1 algorithm)

This paper contains 47 sections, 1 equation, 18 figures, 9 tables, 1 algorithm.

Introduction
Classical Planning
Language Models as World Models
Planning Domains
Experiment 1: The Actions' Effects
Experiment 2: Applicable Actions
Language Models as Planning Heuristics
Search Algorithm
External World Modeling.
Scoring Actions with Language Models
Negative examples.
Data Generation and Augmentation
Experiments
Datasets
Baselines
...and 32 more sections

Figures (18)

Figure 1: In the Ferry planning domain, a problem instance includes an initial state comprising the location of a ferry and several cars, with specified goals for placing the cars in specific locations (left). The ferry is capable of boarding a car and transporting it between locations. The planning task entails generating a sequence of actions (i.e., a plan) such that executing them leads to reaching a goal state (right).
Figure 2: The mean success rate of LLMs in inferring the new state, as a function of the number of actions.
Figure 3: Our proposed similarity-based ranking architecture. Colors green, yellow, and orange are used to denote states, goals, and actions, respectively. At each iteration, a bi-encoder is used to generate contextualized token-level representations for the concatenated current state and goals, as well as for each applicable action. The actions' representations can be extracted once in an offline process. Then, the set of applicable actions is scored based on their similarity with the state and goals representation, using the late-interaction architecture of ColBERT khattab2020colbert.
Figure 4: Blocksworld domain example.
Figure 5: The PDDL Ferry domain definition. The domain definition specifies the predicates and actions, encapsulating the physics of the domain.
...and 13 more figures

What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models

TL;DR

Abstract

What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)