Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations
Hanyi Zhao, Jinxuan Zhu, Zihao Yan, Yichen Li, Yuhong Deng, Xueqian Wang
TL;DR
This work tackles the challenge of generalizable multi-step cloth manipulation by decomposing complex tasks into reusable basic skills. It leverages a large language model to autonomously discover and learn basic language-conditioned skills from long demonstrations, then uses an LLM-based task planner to compose these skills for unseen tasks. A Transformer-based skill learner grounds language-conditioned instructions into pixel-level affordance heatmaps, enabling precise pick/place targets that are translated into 3D trajectories for execution. Across simulation in SoftGym and real-world Franka experiments, the approach demonstrates superior generalization to unseen tasks and transfer from simulation to reality, providing a practical pathway toward robust, language-guided cloth manipulation.
Abstract
Multi-step cloth manipulation is a challenging problem for robots due to the high-dimensional state spaces and the dynamics of cloth. Despite recent significant advances in end-to-end imitation learning for multi-step cloth manipulation skills, these methods fail to generalize to unseen tasks. Our insight in tackling the challenge of generalizable multi-step cloth manipulation is decomposition. We propose a novel pipeline that autonomously learns basic skills from long demonstrations and composes learned basic skills to generalize to unseen tasks. Specifically, our method first discovers and learns basic skills from the existing long demonstration benchmark with the commonsense knowledge of a large language model (LLM). Then, leveraging a high-level LLM-based task planner, these basic skills can be composed to complete unseen tasks. Experimental results demonstrate that our method outperforms baseline methods in learning multi-step cloth manipulation skills for both seen and unseen tasks.
