Table of Contents
Fetching ...

LEMMo-Plan: LLM-Enhanced Learning from Multi-Modal Demonstration for Planning Sequential Contact-Rich Manipulation Tasks

Kejia Chen, Zheng Shen, Yue Zhang, Lingyun Chen, Fan Wu, Zhenshan Bing, Sami Haddadin, Alois Knoll

TL;DR

This work tackles the limitations of vision-only guidance in LLM-based planning for contact-rich manipulation. It introduces a bootstrapped, in-context framework that incorporates tactile sensing and force/torque information to build an object-centric skill library and translate it into a PDDL-domain for grounded reasoning. A demonstration-derived plan serves as a reference for planning new task configurations, with multi-modal data used to segment demonstrations, infer skill sequences, and refine transition conditions. Real-world experiments on cable mounting and cap tightening show improved reasoning and planning performance, demonstrating better generalization and execution reliability in robotic manipulation tasks, with avenues for extending language guidance and tactile-visual model fusion.

Abstract

Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs' ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs' understanding of multi-modal demonstrations and enhancing the overall planning performance.

LEMMo-Plan: LLM-Enhanced Learning from Multi-Modal Demonstration for Planning Sequential Contact-Rich Manipulation Tasks

TL;DR

This work tackles the limitations of vision-only guidance in LLM-based planning for contact-rich manipulation. It introduces a bootstrapped, in-context framework that incorporates tactile sensing and force/torque information to build an object-centric skill library and translate it into a PDDL-domain for grounded reasoning. A demonstration-derived plan serves as a reference for planning new task configurations, with multi-modal data used to segment demonstrations, infer skill sequences, and refine transition conditions. Real-world experiments on cable mounting and cap tightening show improved reasoning and planning performance, demonstrating better generalization and execution reliability in robotic manipulation tasks, with avenues for extending language guidance and tactile-visual model fusion.

Abstract

Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs' ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs' understanding of multi-modal demonstrations and enhancing the overall planning performance.
Paper Structure (11 sections, 6 figures, 3 tables)

This paper contains 11 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Cable Manipulation with Two Robots. (a) Setup for cable mounting demonstration. The human operator controls robots to mount the cable onto a C-type clip and a U-type clip. Because of structural differences, the C-type clip expects a larger pushing force during insertion. (b) Multi-modal perception during cable stretching. Upper: camera observations. Lower: visual-tactile images on robot fingertip. The green arrows in ViTac images indicate force vectors, demonstrating a linear force applied along the grasped cable during the stretching process.
  • Figure 2: Framework Overview. In bootstrapped reasoning, an LLM analyzer pre-processes the skill library, reasons about skill sequences and success conditions from multi-modal demonstration sequentially. The resulting demo task plan is used as an example for an LLM planner to plan for new tasks.
  • Figure 3: Overview of Prompts and Responses in Bootstrapped Reasoning. More prompts can be found on our project website.
  • Figure 4: Tactile Signal Patterns and Corresponding Object Statuses: (a) Sourcing pattern, referring to "grasped" status; (b) Sinking, referring to "released" status; (c) Uniform Flow, referring to "under a linear force" status; (d) Twisted Flow, referring to "under torque" status.
  • Figure 5: Skill Sequences Reasoned from Demonstrations. (a) Sequence reasoned for cable mounting by control groups A and B. The demo skill sequence in Fig. \ref{['fig:prompt_overview']}(c) shows the result by our pipeline. (b) and (c) Sequence reasoned for cap tightening. The resulting sequence of group B is similar to group A but with more move_object steps.
  • ...and 1 more figures