Table of Contents
Fetching ...

LLM-GROP: Visually Grounded Robot Task and Motion Planning with Large Language Models

Xiaohan Zhang, Yan Ding, Yohei Hayamizu, Zainab Altaweel, Yifeng Zhu, Yuke Zhu, Peter Stone, Chris Paxton, Shiqi Zhang

TL;DR

This work presents LLM-GROP, a framework that unifies large language model–driven semantic goal generation with visually grounded task and motion planning for mobile manipulation under underspecified instructions. By extracting symbolic and geometric spatial relations via prompts and validating them with a consistency-checking ASP system, the approach yields semantically valid tabletop configurations. A vision-based feasibility evaluator, trained in simulation, guides the GROP algorithm to choose optimal base positions and generate feasible motion plans, balancing plan feasibility and efficiency. The method is validated in real-robot and simulated experiments, showing higher user-rated quality and competitive efficiency compared with baselines, and demonstrating robustness across several LLM backbones. The results underscore the potential of combining foundation models with perception to address open-world, long-horizon MoMa tasks, while also highlighting practical considerations such as LLM cost and the need for open-world extensions.

Abstract

Task planning and motion planning are two of the most important problems in robotics, where task planning methods help robots achieve high-level goals and motion planning methods maintain low-level feasibility. Task and motion planning (TAMP) methods interleave the two processes of task planning and motion planning to ensure goal achievement and motion feasibility. Within the TAMP context, we are concerned with the mobile manipulation (MoMa) of multiple objects, where it is necessary to interleave actions for navigation and manipulation. In particular, we aim to compute where and how each object should be placed given underspecified goals, such as ``set up dinner table with a fork, knife and plate.'' We leverage the rich common sense knowledge from large language models (LLMs), e.g., about how tableware is organized, to facilitate both task-level and motion-level planning. In addition, we use computer vision methods to learn a strategy for selecting base positions to facilitate MoMa behaviors, where the base position corresponds to the robot's ``footprint'' and orientation in its operating space. Altogether, this article provides a principled TAMP framework for MoMa tasks that accounts for common sense about object rearrangement and is adaptive to novel situations that include many objects that need to be moved. We performed quantitative experiments in both real-world settings and simulated environments. We evaluated the success rate and efficiency in completing long-horizon object rearrangement tasks. While the robot completed 84.4\% real-world object rearrangement trials, subjective human evaluations indicated that the robot's performance is still lower than experienced human waiters.

LLM-GROP: Visually Grounded Robot Task and Motion Planning with Large Language Models

TL;DR

This work presents LLM-GROP, a framework that unifies large language model–driven semantic goal generation with visually grounded task and motion planning for mobile manipulation under underspecified instructions. By extracting symbolic and geometric spatial relations via prompts and validating them with a consistency-checking ASP system, the approach yields semantically valid tabletop configurations. A vision-based feasibility evaluator, trained in simulation, guides the GROP algorithm to choose optimal base positions and generate feasible motion plans, balancing plan feasibility and efficiency. The method is validated in real-robot and simulated experiments, showing higher user-rated quality and competitive efficiency compared with baselines, and demonstrating robustness across several LLM backbones. The results underscore the potential of combining foundation models with perception to address open-world, long-horizon MoMa tasks, while also highlighting practical considerations such as LLM cost and the need for open-world extensions.

Abstract

Task planning and motion planning are two of the most important problems in robotics, where task planning methods help robots achieve high-level goals and motion planning methods maintain low-level feasibility. Task and motion planning (TAMP) methods interleave the two processes of task planning and motion planning to ensure goal achievement and motion feasibility. Within the TAMP context, we are concerned with the mobile manipulation (MoMa) of multiple objects, where it is necessary to interleave actions for navigation and manipulation. In particular, we aim to compute where and how each object should be placed given underspecified goals, such as ``set up dinner table with a fork, knife and plate.'' We leverage the rich common sense knowledge from large language models (LLMs), e.g., about how tableware is organized, to facilitate both task-level and motion-level planning. In addition, we use computer vision methods to learn a strategy for selecting base positions to facilitate MoMa behaviors, where the base position corresponds to the robot's ``footprint'' and orientation in its operating space. Altogether, this article provides a principled TAMP framework for MoMa tasks that accounts for common sense about object rearrangement and is adaptive to novel situations that include many objects that need to be moved. We performed quantitative experiments in both real-world settings and simulated environments. We evaluated the success rate and efficiency in completing long-horizon object rearrangement tasks. While the robot completed 84.4\% real-world object rearrangement trials, subjective human evaluations indicated that the robot's performance is still lower than experienced human waiters.

Paper Structure

This paper contains 14 sections, 6 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: An illustration of our mobile manipulation (MoMa) domain, where a mobile manipulator is tasked with setting a dining table. The robot must arrange several tableware items, including a knife, a fork, a plate, a cup mat, and a mug. These objects are located on other tables, and the environment also includes randomly generated obstacles (e.g., chairs), which are not accounted for in the pre-built map. The robot must compute semantically specified goal configurations of the objects and (task and motion) plans for rearranging the objects on the target table. The computed plan includes both navigation and manipulation behaviors.
  • Figure 2: An overview of the LLM-GROP approach. LLM-GROP takes service requests from humans for setting tables and produces a task-motion plan that the robot can execute. LLM-GROP is comprised of two key components: the LLM and the Task and Motion Planner. The LLM is responsible for creating both symbolic and geometric spatial relationships between the tableware objects. This provides the necessary context for the robot to understand how the objects should be arranged on the table. The Task and Motion Planner generates the plan for the robot to execute based on the information provided by the LLM. An important component of LLM-GROP is GROP that takes a top-down view image as the input and suggests standing positions to facilitate MoMa behaviors. GROP is trained exclusively using simulation data. In the real world, the robot estimates poses of objects and builds a digital twin for task and motion planning. Details of GROP are shown in Figure \ref{['fig:overview_grop']}.
  • Figure 3: An overview of the data collection and training process in GROP. A task corresponds to one "unloading goal" on the table, as well as a configuration of obstacles (chairs in our case). Given a task, every pixel is considered a navigation goal -- the robot attempts to navigate there, and unload an object from there. This navigation-manipulation process is referred to as a trial. The robot performs multiple trials for each navigation goal, which yields a feasibility value for that particular location. The feasibility values together form one heatmap for each task. In our dataset, each instance is a top-down view image, whose label is the corresponding heatmap. The "Dataset" box shows a few "combined heatmaps" where heatmaps are overlaid onto the corresponding images. Training with the dataset generates an FCN that is used for two purposes: 1) evaluating the feasibility of task-level actions, and 2) selecting motion-level navigation goals. Finally, GROP incorporates both efficiency (measured by action costs) and feasibility to compute task-motion plans for a mobile manipulator.
  • Figure 4: An illustrative example of LLM-GROP showing the robot navigation trajectories (dashed lines) as applied to the task of "set the table with a bread plate, a fork, a knife, and a bread." LLM-GROP is able to adapt to complex environments, using common sense extracted from an LLM to generate efficient (i.e., minimize the overall navigation cost) and feasible (i.e., select an available side of the table to unload) pick-and-place motion plans for the robot.
  • Figure 5: Example outcomes of the robot completing object rearrangement tasks. The "easy" environment did not include any obstacles, while the other environments included a chair on one side of the table. Note that the "top" and "bottom" labels shown in the columns were with respect to the robot's view. There were three tasks (IDs 4, 7 and 9 -- see Table \ref{['table:task']}) used for the real-robot experiments covering different numbers of objects being rearranged. The robot dynamically computed the goal configurations of those objects and (task and motion) plans for realizing those configurations.
  • ...and 4 more figures