Table of Contents
Fetching ...

DexKnot: Generalizable Visuomotor Policy Learning for Dexterous Bag-Knotting Manipulation

Jiayuan Zhang, Ruihai Wu, Haojun Chen, Yuran Wang, Yifan Zhong, Ceyao Zhang, Yaodong Yang, Yuanpei Chen

TL;DR

DexKnot is presented, a framework that combines keypoint affordance with diffusion policy to learn a generalizable bag-knotting policy and achieves reliable and consistent knotting performance across a variety of previously unseen instances and deformations.

Abstract

Knotting plastic bags is a common task in daily life, yet it is challenging for robots due to the bags' infinite degrees of freedom and complex physical dynamics. Existing methods often struggle in generalization to unseen bag instances or deformations. To address this, we present DexKnot, a framework that combines keypoint affordance with diffusion policy to learn a generalizable bag-knotting policy. Our approach learns a shape-agnostic representation of bags from keypoint correspondence data collected through real-world manual deformation. For an unseen bag configuration, the keypoints can be identified by matching the representation to a reference. These keypoints are then provided to a diffusion transformer, which generates robot action based on a small number of human demonstrations. DexKnot enables effective policy generalization by reducing the dimensionality of observation space into a sparse set of keypoints. Experiments show that DexKnot achieves reliable and consistent knotting performance across a variety of previously unseen instances and deformations.

DexKnot: Generalizable Visuomotor Policy Learning for Dexterous Bag-Knotting Manipulation

TL;DR

DexKnot is presented, a framework that combines keypoint affordance with diffusion policy to learn a generalizable bag-knotting policy and achieves reliable and consistent knotting performance across a variety of previously unseen instances and deformations.

Abstract

Knotting plastic bags is a common task in daily life, yet it is challenging for robots due to the bags' infinite degrees of freedom and complex physical dynamics. Existing methods often struggle in generalization to unseen bag instances or deformations. To address this, we present DexKnot, a framework that combines keypoint affordance with diffusion policy to learn a generalizable bag-knotting policy. Our approach learns a shape-agnostic representation of bags from keypoint correspondence data collected through real-world manual deformation. For an unseen bag configuration, the keypoints can be identified by matching the representation to a reference. These keypoints are then provided to a diffusion transformer, which generates robot action based on a small number of human demonstrations. DexKnot enables effective policy generalization by reducing the dimensionality of observation space into a sparse set of keypoints. Experiments show that DexKnot achieves reliable and consistent knotting performance across a variety of previously unseen instances and deformations.
Paper Structure (15 sections, 2 equations, 5 figures, 4 tables)

This paper contains 15 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of DexKnot.Top row: Our framework collects keypoint correspondence data through real-world manual deformation, which are used to learn shape-agnostic representations. Middle row: For a novel bag configuration, the keypoints are identified via correspondence matching, which guides the policy to execute the knotting task. Bottom row: Our framework generalizes effectively to unseen deformations and bag instances.
  • Figure 2: Our Proposed Framework.Top left: For each bag, we perform manual deformation while recording RGB-D videos, and then we track the keypoints for correspondence data construction. Top right: The PointNet++ encoder learns to produce similar representations for corresponding keypoints across different deformations using an InfoNCE loss. Bottom row: During policy inference, keypoints are identified in the initial frame through representation matching and tracked across subsequent frames using TAP. These keypoint coordinates are combined with robot joint states and fed into a Diffusion Transformer to generate an action chunk.
  • Figure 3: Robot setup. Our robot platform includes a RealMan RM75-6F dual-arm system with PsiBot G0-R 6-DoF dexterous hands and a head-mounted Intel RealSense D435 RGB-D camera.
  • Figure 4: Bag deformations and instances. Top left: Deformations included in behavior demonstrations. Bottom left: Deformations not included in behavior demonstrations. Top right: bags used for keypoint correspondence data collection. Bottom right: bags used for behavior demonstration data collection and novel bags for cross-instance evaluation that are not included in the keypoint correspondence data or behavior demonstrations.
  • Figure 5: Qualitative comparison of policy executions. Successes and failures are indicated by green and red bounding boxes, respectively. Top row: Both DP3 and DexKnot successfully complete the knotting task under Diagonal-Compressed (DC) deformation conditions. Middle row: In Twisted-Flat (TF) conditions, DP3 fails to thread the handle while DexKnot successfully accomplishes the task. Bottom row: In Inclined-Flat (IF) conditions, DP3 fails to thread the handle while DexKnot successfully accomplishes the task.