Table of Contents
Fetching ...

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

Wenke Xia, Dong Wang, Xincheng Pang, Zhigang Wang, Bin Zhao, Di Hu, Xuelong Li

TL;DR

The paper tackles generalizable articulated object manipulation with minimal robotic data by introducing a kinematic-aware prompting framework that leverages LLM world knowledge. It builds a Unified Kinematic Knowledge Parser to convert object kinematics into a textual K description and a Kinematic-aware Manipulation Planner that yields precise 3D manipulation waypoints through hierarchical prompting and in-context learning. Key contributions include the Kinematic Knowledge Parser and the Kinematic-aware Planner, plus extensive evaluation in simulation across 48 objects from 16 categories and real-world tests showing zero-shot generalization to unseen categories. The approach demonstrates that grounding LLM reasoning in explicit kinematic representations enables low-level control for complex articulated objects, potentially reducing data requirements and improving practical robotics deployment. Limitations noted include dependence on perception accuracy and the need for improved mathematical and spatial reasoning in LLMs, with future work pointing toward integrating vision foundation models to bolster real-world performance.

Abstract

Generalizable articulated object manipulation is essential for home-assistant robots. Recent efforts focus on imitation learning from demonstrations or reinforcement learning in simulation, however, due to the prohibitive costs of real-world data collection and precise object simulation, it still remains challenging for these works to achieve broad adaptability across diverse articulated objects. Recently, many works have tried to utilize the strong in-context learning ability of Large Language Models (LLMs) to achieve generalizable robotic manipulation, but most of these researches focus on high-level task planning, sidelining low-level robotic control. In this work, building on the idea that the kinematic structure of the object determines how we can manipulate it, we propose a kinematic-aware prompting framework that prompts LLMs with kinematic knowledge of objects to generate low-level motion trajectory waypoints, supporting various object manipulation. To effectively prompt LLMs with the kinematic structure of different objects, we design a unified kinematic knowledge parser, which represents various articulated objects as a unified textual description containing kinematic joints and contact location. Building upon this unified description, a kinematic-aware planner model is proposed to generate precise 3D manipulation waypoints via a designed kinematic-aware chain-of-thoughts prompting method. Our evaluation spanned 48 instances across 16 distinct categories, revealing that our framework not only outperforms traditional methods on 8 seen categories but also shows a powerful zero-shot capability for 8 unseen articulated object categories. Moreover, the real-world experiments on 7 different object categories prove our framework's adaptability in practical scenarios. Code is released at https://github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main.

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

TL;DR

The paper tackles generalizable articulated object manipulation with minimal robotic data by introducing a kinematic-aware prompting framework that leverages LLM world knowledge. It builds a Unified Kinematic Knowledge Parser to convert object kinematics into a textual K description and a Kinematic-aware Manipulation Planner that yields precise 3D manipulation waypoints through hierarchical prompting and in-context learning. Key contributions include the Kinematic Knowledge Parser and the Kinematic-aware Planner, plus extensive evaluation in simulation across 48 objects from 16 categories and real-world tests showing zero-shot generalization to unseen categories. The approach demonstrates that grounding LLM reasoning in explicit kinematic representations enables low-level control for complex articulated objects, potentially reducing data requirements and improving practical robotics deployment. Limitations noted include dependence on perception accuracy and the need for improved mathematical and spatial reasoning in LLMs, with future work pointing toward integrating vision foundation models to bolster real-world performance.

Abstract

Generalizable articulated object manipulation is essential for home-assistant robots. Recent efforts focus on imitation learning from demonstrations or reinforcement learning in simulation, however, due to the prohibitive costs of real-world data collection and precise object simulation, it still remains challenging for these works to achieve broad adaptability across diverse articulated objects. Recently, many works have tried to utilize the strong in-context learning ability of Large Language Models (LLMs) to achieve generalizable robotic manipulation, but most of these researches focus on high-level task planning, sidelining low-level robotic control. In this work, building on the idea that the kinematic structure of the object determines how we can manipulate it, we propose a kinematic-aware prompting framework that prompts LLMs with kinematic knowledge of objects to generate low-level motion trajectory waypoints, supporting various object manipulation. To effectively prompt LLMs with the kinematic structure of different objects, we design a unified kinematic knowledge parser, which represents various articulated objects as a unified textual description containing kinematic joints and contact location. Building upon this unified description, a kinematic-aware planner model is proposed to generate precise 3D manipulation waypoints via a designed kinematic-aware chain-of-thoughts prompting method. Our evaluation spanned 48 instances across 16 distinct categories, revealing that our framework not only outperforms traditional methods on 8 seen categories but also shows a powerful zero-shot capability for 8 unseen articulated object categories. Moreover, the real-world experiments on 7 different object categories prove our framework's adaptability in practical scenarios. Code is released at https://github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main.
Paper Structure (14 sections, 4 figures, 4 tables)

This paper contains 14 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: As depicted in (a), traditional learning-based methods rely on vast datasets for broad manipulation tasks. Recent studies in (b) harness LLMs to reduce data reliance, but primarily apply to elementary challenges like obstacle avoidance, and pick-and-place. In contrast, our framework, highlighted in (c), achieves zero-shot articulated object manipulation with the kinematic-aware prompting method.
  • Figure 2: We first propose the Unified Kinematic Knowledge Parser component to grasp the object's kinematic structure as a kinematic knowledge description for LLMs as shown in (a). Based on the description, we construct a kinematic-aware hierarchical prompt, which is used in the Kinematic-aware Manipulation Planner component to guide LLMs to generate an abstract textual manipulation sequence, and 3D manipulation waypoints for generalizable articulated object manipulation in (b). Distinct colors assigned to numbers represent the properties of the different kinematic structure components.
  • Figure 3: The illustration of the articulated objects used in our experiments. Each of these entities corresponds to either a singular or a pair of manipulation instructions.
  • Figure 4: Real-world experiments: we generate 3D manipulation waypoints with our framework for real-world object manipulation. The joint information of cabinet and drawer is estimated by the perception model, while the others are provided manually.