Unifying Inference-Time Planning Language Generation
Prabhu Prakash Kagitha, Bo Sun, Ishan Desai, Andrew Zhu, Cassie Huang, Manling Li, Ziyang Li, Li Zhang
TL;DR
This work tackles the reliability and scalability of LLM-assisted planning by introducing a unifying inference-time framework based on intermediate representations (IR). By systematically evaluating a family of pipelines that span from NL-like representations to high-resource IRs such as PyPDDL, and incorporating solver feedback, the authors show that deeper IR pipelines with revision mechanisms consistently yield higher plan accuracy, especially for open-source LLMs with around 32B parameters. A key contribution is demonstrating the effectiveness of syntax-aligned, high-resource IRs (e.g., Python/PyPDDL wrappers) and solver feedback in bridging natural language and formal planning languages, thereby improving robustness as problem complexity grows. The findings offer practical recipes for building strong LLM-as-formalizer pipelines and point toward future work extending the approach to other planning languages and non-planning DSLs.
Abstract
A line of work in planning uses LLM not to generate a plan, but to generate a formal representation in some planning language, which can be input into a symbolic solver to deterministically find a plan. While showing improved trust and promising performance, dozens of recent publications have proposed scattered methods on a variety of benchmarks under different experimental settings. We attempt to unify the inference-time LLM-as-formalizer methodology for classical planning by proposing a unifying framework based on intermediate representations. We thus systematically evaluate more than a dozen pipelines that subsume most existing work, while proposing novel ones that involve syntactically similar but high resource intermediate languages (such as a Python wrapper of PDDL). We provide recipes for planning language generation pipelines, draw a series of conclusions showing the efficacy of their various components, and evidence their robustness against problem complexity.
