Table of Contents
Fetching ...

An Extensive Evaluation of PDDL Capabilities in off-the-shelf LLMs

Kaustubh Vyas, Damien Graux, Sébastien Montella, Pavlos Vougiouklis, Ruofei Lai, Keshuang Li, Yang Ren, Jeff Z. Pan

TL;DR

This work examines whether off-the-shelf LLMs can handle Planning Domain Definition Language (PDDL) tasks in zero-shot settings. It evaluates 20 models from seven families across three tasks—action generation, problem generation, and plan generation—using Planetarium and Oswald-derived data, with metrics for syntax, solvability, and semantic equivalence, plus a formal similarity measure. The results show that while some models (e.g., GPT-4o, Qwen2.5-72B-Instruct, Mistral-Large2) can produce near-parsable actions and related content, overall proficiency in generating complete PDDL problems and valid plans remains limited, and long-horizon planning is particularly challenging; the LLaMA family often biases outputs toward unparsable syntax. The study highlights the potential of LLMs as co-pilots to assist PDDL drafting but emphasizes the need for coupling with traditional planners and decomposition techniques, guiding future AI-driven planning research and tooling.

Abstract

In recent advancements, large language models (LLMs) have exhibited proficiency in code generation and chain-of-thought reasoning, laying the groundwork for tackling automatic formal planning tasks. This study evaluates the potential of LLMs to understand and generate Planning Domain Definition Language (PDDL), an essential representation in artificial intelligence planning. We conduct an extensive analysis across 20 distinct models spanning 7 major LLM families, both commercial and open-source. Our comprehensive evaluation sheds light on the zero-shot LLM capabilities of parsing, generating, and reasoning with PDDL. Our findings indicate that while some models demonstrate notable effectiveness in handling PDDL, others pose limitations in more complex scenarios requiring nuanced planning knowledge. These results highlight the promise and current limitations of LLMs in formal planning tasks, offering insights into their application and guiding future efforts in AI-driven planning paradigms.

An Extensive Evaluation of PDDL Capabilities in off-the-shelf LLMs

TL;DR

This work examines whether off-the-shelf LLMs can handle Planning Domain Definition Language (PDDL) tasks in zero-shot settings. It evaluates 20 models from seven families across three tasks—action generation, problem generation, and plan generation—using Planetarium and Oswald-derived data, with metrics for syntax, solvability, and semantic equivalence, plus a formal similarity measure. The results show that while some models (e.g., GPT-4o, Qwen2.5-72B-Instruct, Mistral-Large2) can produce near-parsable actions and related content, overall proficiency in generating complete PDDL problems and valid plans remains limited, and long-horizon planning is particularly challenging; the LLaMA family often biases outputs toward unparsable syntax. The study highlights the potential of LLMs as co-pilots to assist PDDL drafting but emphasizes the need for coupling with traditional planners and decomposition techniques, guiding future AI-driven planning research and tooling.

Abstract

In recent advancements, large language models (LLMs) have exhibited proficiency in code generation and chain-of-thought reasoning, laying the groundwork for tackling automatic formal planning tasks. This study evaluates the potential of LLMs to understand and generate Planning Domain Definition Language (PDDL), an essential representation in artificial intelligence planning. We conduct an extensive analysis across 20 distinct models spanning 7 major LLM families, both commercial and open-source. Our comprehensive evaluation sheds light on the zero-shot LLM capabilities of parsing, generating, and reasoning with PDDL. Our findings indicate that while some models demonstrate notable effectiveness in handling PDDL, others pose limitations in more complex scenarios requiring nuanced planning knowledge. These results highlight the promise and current limitations of LLMs in formal planning tasks, offering insights into their application and guiding future efforts in AI-driven planning paradigms.

Paper Structure

This paper contains 11 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: LLM performances across the three benchmarks (higher the better).
  • Figure 2: Performances of LLMs as co-pilots, reviewing closeness (%) of generations to the "gold".