Table of Contents
Fetching ...

PLATO: Planning with LLMs and Affordances for Tool Manipulation

Arvind Car, Sai Sravan Yarlagadda, Alison Bartsch, Abraham George, Amir Barati Farimani

TL;DR

PLATO is an innovative system that addresses the challenge of integrating large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems by leveraging specialized large language model agents.

Abstract

As robotic systems become increasingly integrated into complex real-world environments, there is a growing need for approaches that enable robots to understand and act upon natural language instructions without relying on extensive pre-programmed knowledge of their surroundings. This paper presents PLATO, an innovative system that addresses this challenge by leveraging specialized large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems. Unlike traditional systems that depend on hard-coded environmental information, PLATO employs a modular architecture of specialized agents to operate without any initial knowledge of the environment. These agents identify objects and their locations within the scene, generate a comprehensive high-level plan, translate this plan into a series of low-level actions, and verify the completion of each step. The system is particularly tested on challenging tool-use tasks, which involve handling diverse objects and require long-horizon planning. PLATO's design allows it to adapt to dynamic and unstructured settings, significantly enhancing its flexibility and robustness. By evaluating the system across various complex scenarios, we demonstrate its capability to tackle a diverse range of tasks and offer a novel solution to integrate LLMs with robotic platforms, advancing the state-of-the-art in autonomous robotic task execution. For videos and prompt details, please see our project website: https://sites.google.com/andrew.cmu.edu/plato

PLATO: Planning with LLMs and Affordances for Tool Manipulation

TL;DR

PLATO is an innovative system that addresses the challenge of integrating large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems by leveraging specialized large language model agents.

Abstract

As robotic systems become increasingly integrated into complex real-world environments, there is a growing need for approaches that enable robots to understand and act upon natural language instructions without relying on extensive pre-programmed knowledge of their surroundings. This paper presents PLATO, an innovative system that addresses this challenge by leveraging specialized large language model agents to process natural language inputs, understand the environment, predict tool affordances, and generate executable actions for robotic systems. Unlike traditional systems that depend on hard-coded environmental information, PLATO employs a modular architecture of specialized agents to operate without any initial knowledge of the environment. These agents identify objects and their locations within the scene, generate a comprehensive high-level plan, translate this plan into a series of low-level actions, and verify the completion of each step. The system is particularly tested on challenging tool-use tasks, which involve handling diverse objects and require long-horizon planning. PLATO's design allows it to adapt to dynamic and unstructured settings, significantly enhancing its flexibility and robustness. By evaluating the system across various complex scenarios, we demonstrate its capability to tackle a diverse range of tasks and offer a novel solution to integrate LLMs with robotic platforms, advancing the state-of-the-art in autonomous robotic task execution. For videos and prompt details, please see our project website: https://sites.google.com/andrew.cmu.edu/plato
Paper Structure (20 sections, 4 figures, 2 tables)

This paper contains 20 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: PLATO Overview. Our proposed system, PLATO, takes in an environmental observation, predicts the tool's affordance given the prompted task, and prompts a modular LLM framework to generate an action plan.
  • Figure 2: Method Pipeline: PLATO takes in a user prompt and multi-view images of the scene. These are passed to the Scene Comprehension LLM to list out the relevant objects present in the scene, and to classify them as a tool or not tool. This list of objects is passed to the SAM vision module, which segments the point clouds of each object, thereby getting its centroid and dimensions. Meanwhile, the Overall Planner LLM outputs a high-level sequence of commands, which are iteratively converted to low-level actions by the Step Planner LLM. These actions are sequentially executed by the robot.
  • Figure 3: Task-Oriented Grasping Module. The process starts by generating a mask of the target object using an overhead image captured by a camera mounted on the end-effector. Potential grasps are then generated and refined based on the tool’s graspable region. The task and tool are passed to an LLM, which maps the query tool to the most similar one in the affordance model's database. The model uses this mapping to refine grasp selection, and if the mapping fails, the mask of the entire tool is used to guide the grasp choice.
  • Figure 4: Hardware Setup. A visualization of the 7DoF Franka robot in the table-top manipulation workspace with 4 Intel RealSense D415 cameras to observe the scene, and a single wrist-mounted Intel RealSense D415 camera.