Table of Contents
Fetching ...

AutoGPT+P: Affordance-based Task Planning with Large Language Models

Timo Birr, Christoph Pohl, Abdelrahman Younes, Tamim Asfour

TL;DR

AutoGPT+P tackles open-world robotic planning by introducing an affordance-based scene representation that connects perception to symbolic planning. It combines object detection with a ChatGPT-generated Object Affordance Mapping to derive a dynamic PDDL domain, enabling the LLM to produce and refine goal states which a classical planner then converts into executable plans. A memory-driven feedback loop enables tool selection, exploration, and substitute reasoning, with self-correction for semantic and syntactic errors improving success from 78% to 98% in some settings. Evaluations in simulation and on ARMAR-6/ARMAR-DE show strong performance in handling missing objects and producing robust plans, though future work is needed to incorporate probabilistic reasoning and richer execution feedback. Overall, AutoGPT+P advances open-world robotic planning by tightly integrating perception, affordances, and planning under uncertainty.

Abstract

Recent advances in task planning leverage Large Language Models (LLMs) to improve generalizability by combining such models with classical planning algorithms to address their inherent limitations in reasoning capabilities. However, these approaches face the challenge of dynamically capturing the initial state of the task planning problem. To alleviate this issue, we propose AutoGPT+P, a system that combines an affordance-based scene representation with a planning system. Affordances encompass the action possibilities of an agent on the environment and objects present in it. Thus, deriving the planning domain from an affordance-based scene representation allows symbolic planning with arbitrary objects. AutoGPT+P leverages this representation to derive and execute a plan for a task specified by the user in natural language. In addition to solving planning tasks under a closed-world assumption, AutoGPT+P can also handle planning with incomplete information, e. g., tasks with missing objects by exploring the scene, suggesting alternatives, or providing a partial plan. The affordance-based scene representation combines object detection with an automatically generated object-affordance-mapping using ChatGPT. The core planning tool extends existing work by automatically correcting semantic and syntactic errors. Our approach achieves a success rate of 98%, surpassing the current 81% success rate of the current state-of-the-art LLM-based planning method SayCan on the SayCan instruction set. Furthermore, we evaluated our approach on our newly created dataset with 150 scenarios covering a wide range of complex tasks with missing objects, achieving a success rate of 79% on our dataset. The dataset and the code are publicly available at https://git.h2t.iar.kit.edu/birr/autogpt-p-standalone.

AutoGPT+P: Affordance-based Task Planning with Large Language Models

TL;DR

AutoGPT+P tackles open-world robotic planning by introducing an affordance-based scene representation that connects perception to symbolic planning. It combines object detection with a ChatGPT-generated Object Affordance Mapping to derive a dynamic PDDL domain, enabling the LLM to produce and refine goal states which a classical planner then converts into executable plans. A memory-driven feedback loop enables tool selection, exploration, and substitute reasoning, with self-correction for semantic and syntactic errors improving success from 78% to 98% in some settings. Evaluations in simulation and on ARMAR-6/ARMAR-DE show strong performance in handling missing objects and producing robust plans, though future work is needed to incorporate probabilistic reasoning and richer execution feedback. Overall, AutoGPT+P advances open-world robotic planning by tightly integrating perception, affordances, and planning under uncertainty.

Abstract

Recent advances in task planning leverage Large Language Models (LLMs) to improve generalizability by combining such models with classical planning algorithms to address their inherent limitations in reasoning capabilities. However, these approaches face the challenge of dynamically capturing the initial state of the task planning problem. To alleviate this issue, we propose AutoGPT+P, a system that combines an affordance-based scene representation with a planning system. Affordances encompass the action possibilities of an agent on the environment and objects present in it. Thus, deriving the planning domain from an affordance-based scene representation allows symbolic planning with arbitrary objects. AutoGPT+P leverages this representation to derive and execute a plan for a task specified by the user in natural language. In addition to solving planning tasks under a closed-world assumption, AutoGPT+P can also handle planning with incomplete information, e. g., tasks with missing objects by exploring the scene, suggesting alternatives, or providing a partial plan. The affordance-based scene representation combines object detection with an automatically generated object-affordance-mapping using ChatGPT. The core planning tool extends existing work by automatically correcting semantic and syntactic errors. Our approach achieves a success rate of 98%, surpassing the current 81% success rate of the current state-of-the-art LLM-based planning method SayCan on the SayCan instruction set. Furthermore, we evaluated our approach on our newly created dataset with 150 scenarios covering a wide range of complex tasks with missing objects, achieving a success rate of 79% on our dataset. The dataset and the code are publicly available at https://git.h2t.iar.kit.edu/birr/autogpt-p-standalone.
Paper Structure (27 sections, 11 equations, 5 figures, 6 tables, 4 algorithms)

This paper contains 27 sections, 11 equations, 5 figures, 6 tables, 4 algorithms.

Figures (5)

  • Figure 1: ARMAR-DE solves the user task given in natural language by detecting the objects within the scene, reasoning about their affordances, planning how to solve the task including asking for help and finally executing the plan.
  • Figure 2: A taxonomy of LLMs in planning tasks with the related work from this section referenced.
  • Figure 3: Overview of Object Affordance Detection(OAD). It uses an RGB image of a scene to detect the objects present in the scene. In the second step, the Object Affordance Mapping (OAM) maps the objects to their corresponding affordances.
  • Figure 4: Overview of the AutoGPT+P feedback loop presented in \ref{['sec:loop']}. Green boxes symbolize inputs and outputs, while blue boxes symbolize discrete steps of the process. The tool selection process chooses one of the tools in the yellow Tools box. The numbers on top of the boxes show in which section the aspect of the work is explained.
  • Figure 5: Overview of the Planning Tool. Rounded boxes represent the input and the output of the components that are represented as rectangles.