Learning Language-Conditioned Deformable Object Manipulation with Graph Dynamics

Yuhong Deng; Kai Mo; Chongkun Xia; Xueqian Wang

Learning Language-Conditioned Deformable Object Manipulation with Graph Dynamics

Yuhong Deng, Kai Mo, Chongkun Xia, Xueqian Wang

TL;DR

This work introduces a language-conditioned deformable object manipulation framework that fuses language, depth vision, and a visible connectivity graph within a Transformer encoder-decoder to predict sequential pick-and-place actions. By grounding language with CLIP, using depth-based vision for transfer to the real world, and modeling deformable structure with a graph learned via a pre-trained edge GNN, the method achieves superior multi-task performance and generalization to unseen instructions and tasks. Empirical results in SoftGym show strong performance gains over state-of-the-art baselines, and real-world experiments demonstrate notable sim-to-real transfer with competitive success rates. The approach also emphasizes data efficiency and a dedicated success classifier for autonomous task termination, offering practical impact for flexible, language-driven manipulation of complex deformable objects.

Abstract

Multi-task learning of deformable object manipulation is a challenging problem in robot manipulation. Most previous works address this problem in a goal-conditioned way and adapt goal images to specify different tasks, which limits the multi-task learning performance and can not generalize to new tasks. Thus, we adapt language instruction to specify deformable object manipulation tasks and propose a learning framework. We first design a unified Transformer-based architecture to understand multi-modal data and output picking and placing action. Besides, we have introduced the visible connectivity graph to tackle nonlinear dynamics and complex configuration of the deformable object. Both simulated and real experiments have demonstrated that the proposed method is effective and can generalize to unseen instructions and tasks. Compared with the state-of-the-art method, our method achieves higher success rates (87.2% on average) and has a 75.6% shorter inference time. We also demonstrate that our method performs well in real-world experiments.

Learning Language-Conditioned Deformable Object Manipulation with Graph Dynamics

TL;DR

Abstract

Paper Structure (13 sections, 12 equations, 4 figures, 4 tables)

This paper contains 13 sections, 12 equations, 4 figures, 4 tables.

INTRODUCTION
RELATED WORK
Learning for Deformable Object Manipulation
Language-conditioned Robotic Manipulation policy
Methods
Problem Formulation
Model Architecture
Implementation Details:
Experiments
Simulation Experiments Setup
Simulation Experiment Results
Real World Experiments
Conclusion

Figures (4)

Figure 1: Overview. We design a unified Transformer-based model and introduce graph representation to solve language-conditioned deformable object manipulation tasks. Our model performs well on deformable object manipulation tasks.
Figure 2: Method overview. We design a unified Transformer-based model architecture to understand the multi-modal data and output picking and placing action with task completion prediction. We introduce a visible connectivity graph to tackle deformable objects' complex configurations and dynamics.
Figure 3: Some examples of language-conditioned deformable object manipulation Tasks. Seen instructions, unseen instructions, unseen tasks are marked in black, grey and red, respectively.
Figure 4: Real World Experiments. Our model performs well in language-conditioned deformable object manipulation tasks and can generalize to unseen tasks in the real world. Unseen tasks are marked in red.

Learning Language-Conditioned Deformable Object Manipulation with Graph Dynamics

TL;DR

Abstract

Learning Language-Conditioned Deformable Object Manipulation with Graph Dynamics

Authors

TL;DR

Abstract

Table of Contents

Figures (4)