Table of Contents
Fetching ...

Unifying Inference-Time Planning Language Generation

Prabhu Prakash Kagitha, Bo Sun, Ishan Desai, Andrew Zhu, Cassie Huang, Manling Li, Ziyang Li, Li Zhang

TL;DR

This work tackles the reliability and scalability of LLM-assisted planning by introducing a unifying inference-time framework based on intermediate representations (IR). By systematically evaluating a family of pipelines that span from NL-like representations to high-resource IRs such as PyPDDL, and incorporating solver feedback, the authors show that deeper IR pipelines with revision mechanisms consistently yield higher plan accuracy, especially for open-source LLMs with around 32B parameters. A key contribution is demonstrating the effectiveness of syntax-aligned, high-resource IRs (e.g., Python/PyPDDL wrappers) and solver feedback in bridging natural language and formal planning languages, thereby improving robustness as problem complexity grows. The findings offer practical recipes for building strong LLM-as-formalizer pipelines and point toward future work extending the approach to other planning languages and non-planning DSLs.

Abstract

A line of work in planning uses LLM not to generate a plan, but to generate a formal representation in some planning language, which can be input into a symbolic solver to deterministically find a plan. While showing improved trust and promising performance, dozens of recent publications have proposed scattered methods on a variety of benchmarks under different experimental settings. We attempt to unify the inference-time LLM-as-formalizer methodology for classical planning by proposing a unifying framework based on intermediate representations. We thus systematically evaluate more than a dozen pipelines that subsume most existing work, while proposing novel ones that involve syntactically similar but high resource intermediate languages (such as a Python wrapper of PDDL). We provide recipes for planning language generation pipelines, draw a series of conclusions showing the efficacy of their various components, and evidence their robustness against problem complexity.

Unifying Inference-Time Planning Language Generation

TL;DR

This work tackles the reliability and scalability of LLM-assisted planning by introducing a unifying inference-time framework based on intermediate representations (IR). By systematically evaluating a family of pipelines that span from NL-like representations to high-resource IRs such as PyPDDL, and incorporating solver feedback, the authors show that deeper IR pipelines with revision mechanisms consistently yield higher plan accuracy, especially for open-source LLMs with around 32B parameters. A key contribution is demonstrating the effectiveness of syntax-aligned, high-resource IRs (e.g., Python/PyPDDL wrappers) and solver feedback in bridging natural language and formal planning languages, thereby improving robustness as problem complexity grows. The findings offer practical recipes for building strong LLM-as-formalizer pipelines and point toward future work extending the approach to other planning languages and non-planning DSLs.

Abstract

A line of work in planning uses LLM not to generate a plan, but to generate a formal representation in some planning language, which can be input into a symbolic solver to deterministically find a plan. While showing improved trust and promising performance, dozens of recent publications have proposed scattered methods on a variety of benchmarks under different experimental settings. We attempt to unify the inference-time LLM-as-formalizer methodology for classical planning by proposing a unifying framework based on intermediate representations. We thus systematically evaluate more than a dozen pipelines that subsume most existing work, while proposing novel ones that involve syntactically similar but high resource intermediate languages (such as a Python wrapper of PDDL). We provide recipes for planning language generation pipelines, draw a series of conclusions showing the efficacy of their various components, and evidence their robustness against problem complexity.

Paper Structure

This paper contains 37 sections, 54 figures, 1 table.

Figures (54)

  • Figure 1: An illustration of using LLM as a planner or a formalizer in classical planning. While $\textit{LLM-as-planner}$ generates a plan directly, $\textit{LLM-as-formalizer}$ formalizes a PDDL domain file and problem which evoke a solver to find a plan. The plan is evaluated against ground-truth PDDL that simulates the environment.
  • Figure 2: Syntactic accuracy and plan accuracy of various methods grouped by levels and LLMs on BlocksWorld and Logistics. Error bars shows the standard deviation over 3 runs.
  • Figure 3: Qualitative examples of a same problem in BlocksWorld, juxtaposing different IRs generated including NL, PyPDDL, PDDL, before generating the final PDDL (Level 2). Also included is a revised PDDL based on solver feedback and a directly generated plan (Level 0).
  • Figure 4: The performance of two LLMs and four methods including usage as planners and as formalizers on BlocksWorld problems with an increasing entity space.
  • Figure 5: Syntactic accuracy and plan accuracy of various methods grouped by levels and LLMs on Sokoban and CoinCollector. Error bars shows the standard deviation over 3 runs.
  • ...and 49 more figures