Table of Contents
Fetching ...

Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations

Hanyi Zhao, Jinxuan Zhu, Zihao Yan, Yichen Li, Yuhong Deng, Xueqian Wang

TL;DR

This work tackles the challenge of generalizable multi-step cloth manipulation by decomposing complex tasks into reusable basic skills. It leverages a large language model to autonomously discover and learn basic language-conditioned skills from long demonstrations, then uses an LLM-based task planner to compose these skills for unseen tasks. A Transformer-based skill learner grounds language-conditioned instructions into pixel-level affordance heatmaps, enabling precise pick/place targets that are translated into 3D trajectories for execution. Across simulation in SoftGym and real-world Franka experiments, the approach demonstrates superior generalization to unseen tasks and transfer from simulation to reality, providing a practical pathway toward robust, language-guided cloth manipulation.

Abstract

Multi-step cloth manipulation is a challenging problem for robots due to the high-dimensional state spaces and the dynamics of cloth. Despite recent significant advances in end-to-end imitation learning for multi-step cloth manipulation skills, these methods fail to generalize to unseen tasks. Our insight in tackling the challenge of generalizable multi-step cloth manipulation is decomposition. We propose a novel pipeline that autonomously learns basic skills from long demonstrations and composes learned basic skills to generalize to unseen tasks. Specifically, our method first discovers and learns basic skills from the existing long demonstration benchmark with the commonsense knowledge of a large language model (LLM). Then, leveraging a high-level LLM-based task planner, these basic skills can be composed to complete unseen tasks. Experimental results demonstrate that our method outperforms baseline methods in learning multi-step cloth manipulation skills for both seen and unseen tasks.

Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations

TL;DR

This work tackles the challenge of generalizable multi-step cloth manipulation by decomposing complex tasks into reusable basic skills. It leverages a large language model to autonomously discover and learn basic language-conditioned skills from long demonstrations, then uses an LLM-based task planner to compose these skills for unseen tasks. A Transformer-based skill learner grounds language-conditioned instructions into pixel-level affordance heatmaps, enabling precise pick/place targets that are translated into 3D trajectories for execution. Across simulation in SoftGym and real-world Franka experiments, the approach demonstrates superior generalization to unseen tasks and transfer from simulation to reality, providing a practical pathway toward robust, language-guided cloth manipulation.

Abstract

Multi-step cloth manipulation is a challenging problem for robots due to the high-dimensional state spaces and the dynamics of cloth. Despite recent significant advances in end-to-end imitation learning for multi-step cloth manipulation skills, these methods fail to generalize to unseen tasks. Our insight in tackling the challenge of generalizable multi-step cloth manipulation is decomposition. We propose a novel pipeline that autonomously learns basic skills from long demonstrations and composes learned basic skills to generalize to unseen tasks. Specifically, our method first discovers and learns basic skills from the existing long demonstration benchmark with the commonsense knowledge of a large language model (LLM). Then, leveraging a high-level LLM-based task planner, these basic skills can be composed to complete unseen tasks. Experimental results demonstrate that our method outperforms baseline methods in learning multi-step cloth manipulation skills for both seen and unseen tasks.

Paper Structure

This paper contains 15 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Generalizable multi-step cloth manipulation. The proposed method can learn generalizable basic skills from long demonstrations and generalize to unseen multi-step cloth manipulation tasks.
  • Figure 2: Method overview. The proposed framework consists of three stages. First, we perform skill discovery from long demonstrations and establish a language-conditioned basic skill dataset. The established dataset will then be used to train the basic skills. Finally, an LLM-based task planner will be used to compose the basic skills learned for unseen multi-step manipulation tasks.
  • Figure 3: Autonomous basic skill discover. We prompt an LLM to discover basic skills from long demonstrations.
  • Figure 4: Language-conditioned basic skills learning. We train a Transformer-based model that takes language and depth images as input and outputs a heatmap of manipulation position.
  • Figure 5: Qualitative results of real experiments. Our method performs well in multi-step manipulation tasks and can generalize to unseen tasks in the real world.
  • ...and 1 more figures