Table of Contents
Fetching ...

Large Language Models as Planning Domain Generators

James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, Shirin Sohrabi

TL;DR

This work investigates whether large language models can automatically generate executable planning domain models in PDDL from natural language descriptions. It introduces an action-by-action prompting approach with three description classes and two automated domain-quality metrics (ARE and a plan-equivalence heuristic) to evaluate generated domains against ground-truth domains. Across 9 domains and 7 LLMs, larger models like LLaMA-2-70b achieve notable success in producing syntactically valid PDDL and a nontrivial fraction of heuristically equivalent domains, demonstrating moderate capability for automated domain reconstruction. The study provides a rigorous, automated evaluation framework and opens avenues for improving automated domain construction and broader adoption of symbolic planning through LLM assistance.

Abstract

Developing domain models is one of the few remaining places that require manual human labor in AI planning. Thus, in order to make planning more accessible, it is desirable to automate the process of domain model generation. To this end, we investigate if large language models (LLMs) can be used to generate planning domain models from simple textual descriptions. Specifically, we introduce a framework for automated evaluation of LLM-generated domains by comparing the sets of plans for domain instances. Finally, we perform an empirical analysis of 7 large language models, including coding and chat models across 9 different planning domains, and under three classes of natural language domain descriptions. Our results indicate that LLMs, particularly those with high parameter counts, exhibit a moderate level of proficiency in generating correct planning domains from natural language descriptions. Our code is available at https://github.com/IBM/NL2PDDL.

Large Language Models as Planning Domain Generators

TL;DR

This work investigates whether large language models can automatically generate executable planning domain models in PDDL from natural language descriptions. It introduces an action-by-action prompting approach with three description classes and two automated domain-quality metrics (ARE and a plan-equivalence heuristic) to evaluate generated domains against ground-truth domains. Across 9 domains and 7 LLMs, larger models like LLaMA-2-70b achieve notable success in producing syntactically valid PDDL and a nontrivial fraction of heuristically equivalent domains, demonstrating moderate capability for automated domain reconstruction. The study provides a rigorous, automated evaluation framework and opens avenues for improving automated domain construction and broader adoption of symbolic planning through LLM assistance.

Abstract

Developing domain models is one of the few remaining places that require manual human labor in AI planning. Thus, in order to make planning more accessible, it is desirable to automate the process of domain model generation. To this end, we investigate if large language models (LLMs) can be used to generate planning domain models from simple textual descriptions. Specifically, we introduce a framework for automated evaluation of LLM-generated domains by comparing the sets of plans for domain instances. Finally, we perform an empirical analysis of 7 large language models, including coding and chat models across 9 different planning domains, and under three classes of natural language domain descriptions. Our results indicate that LLMs, particularly those with high parameter counts, exhibit a moderate level of proficiency in generating correct planning domains from natural language descriptions. Our code is available at https://github.com/IBM/NL2PDDL.
Paper Structure (25 sections, 1 equation, 17 figures, 1 table)

This paper contains 25 sections, 1 equation, 17 figures, 1 table.

Figures (17)

  • Figure 1: A high-level overview of our proposed task.
  • Figure 2: (Top) Characterizing LMM outputs in terms of core result classes. (Bottom) Breakdown of Diff domain subclasses.
  • Figure 3: Overview of LLaMA result class percentages with respect to model size. Contains both chat and base models.
  • Figure 4: Breakdown of LLMs over top level result classes vs different description classes.
  • Figure 5: Action Reconstruction Error (ARE) distribution with respect to reconstruction class over LLMs.
  • ...and 12 more figures