Table of Contents
Fetching ...

A Systematic Study of Large Language Models for Task and Motion Planning With PDDLStream

Jorge Mendez-Mendez

TL;DR

This work probes whether large language models can robustly plan long-horizon robotics tasks expressed as PDDLStream problems. It introduces 16 LLM-based planners that substitute components of two base TAMP algorithms (Adaptive and Bilevel) and evaluates zero-shot performance on 4,950 problems across three domains using Gemini 2.5 Flash. The findings show that while LLM-based planners can solve some problems, they generally lag engineered TAMP planners in both success rate and planning time, with integrated prompting and geometry-aware sampling often offering little or negative benefit. The study highlights that leveraging LLMs for candidate generation while delegating geometry reasoning to the TAMP core yields the most favorable trade-off, and it points to token limits and time constraints as key bottlenecks for practical LLM-assisted planning.

Abstract

Using large language models (LLMs) to solve complex robotics problems requires understanding their planning capabilities. Yet while we know that LLMs can plan on some problems, the extent to which these planning capabilities cover the space of robotics tasks is unclear. One promising direction is to integrate the semantic knowledge of LLMs with the formal reasoning of task and motion planning (TAMP). However, the myriad of choices for how to integrate LLMs within TAMP complicates the design of such systems. We develop 16 algorithms that use Gemini 2.5 Flash to substitute key TAMP components. Our zero-shot experiments across 4,950 problems and three domains reveal that the Gemini-based planners exhibit lower success rates and higher planning times than their engineered counterparts. We show that providing geometric details increases the number of task-planning errors compared to pure PDDL descriptions, and that (faster) non-reasoning LLM variants outperform (slower) reasoning variants in most cases, since the TAMP system can direct the LLM to correct its mistakes.

A Systematic Study of Large Language Models for Task and Motion Planning With PDDLStream

TL;DR

This work probes whether large language models can robustly plan long-horizon robotics tasks expressed as PDDLStream problems. It introduces 16 LLM-based planners that substitute components of two base TAMP algorithms (Adaptive and Bilevel) and evaluates zero-shot performance on 4,950 problems across three domains using Gemini 2.5 Flash. The findings show that while LLM-based planners can solve some problems, they generally lag engineered TAMP planners in both success rate and planning time, with integrated prompting and geometry-aware sampling often offering little or negative benefit. The study highlights that leveraging LLMs for candidate generation while delegating geometry reasoning to the TAMP core yields the most favorable trade-off, and it points to token limits and time constraints as key bottlenecks for practical LLM-assisted planning.

Abstract

Using large language models (LLMs) to solve complex robotics problems requires understanding their planning capabilities. Yet while we know that LLMs can plan on some problems, the extent to which these planning capabilities cover the space of robotics tasks is unclear. One promising direction is to integrate the semantic knowledge of LLMs with the formal reasoning of task and motion planning (TAMP). However, the myriad of choices for how to integrate LLMs within TAMP complicates the design of such systems. We develop 16 algorithms that use Gemini 2.5 Flash to substitute key TAMP components. Our zero-shot experiments across 4,950 problems and three domains reveal that the Gemini-based planners exhibit lower success rates and higher planning times than their engineered counterparts. We show that providing geometric details increases the number of task-planning errors compared to pure PDDL descriptions, and that (faster) non-reasoning LLM variants outperform (slower) reasoning variants in most cases, since the TAMP system can direct the LLM to correct its mistakes.

Paper Structure

This paper contains 32 sections, 8 figures.

Figures (8)

  • Figure 1: Rovers domain garrett2020pddlstream. Two Turtlebots must acquire one rock sample (black patch) and one soil sample (brown patch), photograph $k=4$ objectives (blue boxes), and send results to the Husky. Obstacles limit visibility and traversability.
  • Figure 2: Summary of the 16 LLM-based planners.
  • Figure 3: Left: Blocks domain garrett2020pddlstream. PR2 must place one of the blue boxes on the green region, but the nearest box is blocked. Right: Packing domain with $k=5$ objects garrett2020pddlstream. PR2 must place all boxes on the green region, avoiding collisions.
  • Figure 4: Success rate across 50 problems per algorithm and domain (compact letter display atop bars). When considering all domains, non-LLM methods achieve higher success rates than the LLM-based variants. integrated approaches in general achieve the lowest success rates, and all LLM-based methods solve a small portion of the problems in Rovers domains.
  • Figure 5: Fraction of 50 problems per domain that failed for each reason per algorithm. Most failures occur due to timeouts, but LLMs also often give up and assert that a (solvable) problem is not solvable. Some algorithms often face the input token limit of $10^6$ TPM imposed by the Gemini API; this is especially common in direct (non-thinking) LLM variants. (LLM usage legend: Pd= pddl, Ps= poses, PP= pddl+ poses, In= integrated.)
  • ...and 3 more figures