Table of Contents
Fetching ...

Empowering Large Language Models on Robotic Manipulation with Affordance Prompting

Guangran Cheng, Chuheng Zhang, Wenzhe Cai, Li Zhao, Changyin Sun, Jiang Bian

TL;DR

This paper tackles the challenge of grounding large language models (LLMs) for physical robotic manipulation by introducing LLM+A, a training-free framework where an LLM acts as both sub-task planner and motion controller. A vision-language model provides environment observations, while affordance prompting elicits goal-conditioned consequences and object-part affordances to ground planning and control in the real world. Experiments across Language-Table and CLIPORT benchmarks show that affordance prompting significantly improves plan feasibility and control execution, yielding notable performance gains over baselines and strong generalization to new environments and robotic setups. The approach reduces the data burden typical of robotics by leveraging pretrained LLMs and VLMs, with potential for broader application to diverse physical tasks.

Abstract

While large language models (LLMs) are successful in completing various language processing tasks, they easily fail to interact with the physical world by generating control sequences properly. We find that the main reason is that LLMs are not grounded in the physical world. Existing LLM-based approaches circumvent this problem by relying on additional pre-defined skills or pre-trained sub-policies, making it hard to adapt to new tasks. In contrast, we aim to address this problem and explore the possibility to prompt pre-trained LLMs to accomplish a series of robotic manipulation tasks in a training-free paradigm. Accordingly, we propose a framework called LLM+A(ffordance) where the LLM serves as both the sub-task planner (that generates high-level plans) and the motion controller (that generates low-level control sequences). To ground these plans and control sequences on the physical world, we develop the affordance prompting technique that stimulates the LLM to 1) predict the consequences of generated plans and 2) generate affordance values for relevant objects. Empirically, we evaluate the effectiveness of LLM+A in various language-conditioned robotic manipulation tasks, which show that our approach substantially improves performance by enhancing the feasibility of generated plans and control and can easily generalize to different environments.

Empowering Large Language Models on Robotic Manipulation with Affordance Prompting

TL;DR

This paper tackles the challenge of grounding large language models (LLMs) for physical robotic manipulation by introducing LLM+A, a training-free framework where an LLM acts as both sub-task planner and motion controller. A vision-language model provides environment observations, while affordance prompting elicits goal-conditioned consequences and object-part affordances to ground planning and control in the real world. Experiments across Language-Table and CLIPORT benchmarks show that affordance prompting significantly improves plan feasibility and control execution, yielding notable performance gains over baselines and strong generalization to new environments and robotic setups. The approach reduces the data burden typical of robotics by leveraging pretrained LLMs and VLMs, with potential for broader application to diverse physical tasks.

Abstract

While large language models (LLMs) are successful in completing various language processing tasks, they easily fail to interact with the physical world by generating control sequences properly. We find that the main reason is that LLMs are not grounded in the physical world. Existing LLM-based approaches circumvent this problem by relying on additional pre-defined skills or pre-trained sub-policies, making it hard to adapt to new tasks. In contrast, we aim to address this problem and explore the possibility to prompt pre-trained LLMs to accomplish a series of robotic manipulation tasks in a training-free paradigm. Accordingly, we propose a framework called LLM+A(ffordance) where the LLM serves as both the sub-task planner (that generates high-level plans) and the motion controller (that generates low-level control sequences). To ground these plans and control sequences on the physical world, we develop the affordance prompting technique that stimulates the LLM to 1) predict the consequences of generated plans and 2) generate affordance values for relevant objects. Empirically, we evaluate the effectiveness of LLM+A in various language-conditioned robotic manipulation tasks, which show that our approach substantially improves performance by enhancing the feasibility of generated plans and control and can easily generalize to different environments.
Paper Structure (10 sections, 4 figures, 3 tables)

This paper contains 10 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Consider the task of "Push the purple block to the left side of the table". When the control sequences generated from LLMs are not grounded in the physical world, the robot will move to the left side of the block to push it to the left (a) instead of the right location (b). This is due to the gap between the physical world and generated language plans. This gap can be bridged by prompting LLMs to predict execution consequences and goal-conditioned affordance values (c) in the proposed LLM+A method.
  • Figure 2: Overview of LLM+A. Given language instructions and image observations, LLM+A produces sub-task plans and control sequences for robotic control tasks. LLM+A is composed of a VLM and a hierarchical LLM. The VLM serves as an observation descriptor to provide the environment perception to the LLM. The high-level LLM is responsible for sub-task planning and the low-level LLM for motion controlling. Notably, the affordance values from the high-level LLM are necessary intermediate information for the LLM to understand the effects of potential actions and generate feasible plans grounded in the physical world.
  • Figure 3: Examples of environmental observation and robot trajectories in Block-to-Position (a-d), Block-to-Block (e-h), and Separate (i-l). The gray cylinder indicates the position of the robot end-effector. The blue dots and the green lines represent the waypoints and the planned paths of the control sequences generated by LLM+A, respectively. The red boxes denote the detected bounding boxes from Grounding DINO.
  • Figure 4: Example of environmental observation and affordance prediction from LLM in Towers-of-Hanoi task.