Table of Contents
Fetching ...

ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing

Yongqiang Zhao, Haining Luo, Yupeng Wang, Emmanouil Spyrakos Papastavridis, Yiannis Demiris, Shan Luo

Abstract

Deformable objects often appear in unstructured configurations. Tracing deformable objects helps bringing them into extended states and facilitating the downstream manipulation tasks. Due to the requirements for object-specific modeling or sim-to-real transfer, existing tracing methods either lack generalizability across different categories of deformable objects or struggle to complete tasks reliably in the real world. To address this, we propose a novel visual-tactile imitation learning method to achieve one-dimensional (1D) and two-dimensional (2D) deformable object tracing with a unified model. Our method is designed from both local and global perspectives based on visual and tactile sensing. Locally, we introduce a weighted loss that emphasizes actions maintaining contact near the center of the tactile image, improving fine-grained adjustment. Globally, we propose a tracing task loss that helps the policy to regulate task progression. On the hardware side, to compensate for the limited features extracted from visual information, we integrate tactile sensing into a low-cost teleoperation system considering both the teleoperator and the robot. Extensive ablation and comparative experiments on diverse 1D and 2D deformable objects demonstrate the effectiveness of our approach, achieving an average success rate of 80% on seen objects and 65% on unseen objects.

ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing

Abstract

Deformable objects often appear in unstructured configurations. Tracing deformable objects helps bringing them into extended states and facilitating the downstream manipulation tasks. Due to the requirements for object-specific modeling or sim-to-real transfer, existing tracing methods either lack generalizability across different categories of deformable objects or struggle to complete tasks reliably in the real world. To address this, we propose a novel visual-tactile imitation learning method to achieve one-dimensional (1D) and two-dimensional (2D) deformable object tracing with a unified model. Our method is designed from both local and global perspectives based on visual and tactile sensing. Locally, we introduce a weighted loss that emphasizes actions maintaining contact near the center of the tactile image, improving fine-grained adjustment. Globally, we propose a tracing task loss that helps the policy to regulate task progression. On the hardware side, to compensate for the limited features extracted from visual information, we integrate tactile sensing into a low-cost teleoperation system considering both the teleoperator and the robot. Extensive ablation and comparative experiments on diverse 1D and 2D deformable objects demonstrate the effectiveness of our approach, achieving an average success rate of 80% on seen objects and 65% on unseen objects.
Paper Structure (23 sections, 10 equations, 7 figures, 3 tables)

This paper contains 23 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) 1d deformable object tracing. (b) 2d deformable object tracing. Using our proposed method, an ABB YuMi traces the objects by sliding a gripper along the 1d deformable object or an edge of the 2d deformable object and transforming them from the unstructured configurations on the left to the extended states on the right.
  • Figure 2: Visual-tactile teleoperation system for collecting demonstrations. On the robot side, a top-view camera is installed, and a tactile sensor is mounted on the follower robot’s gripper. On the teleoperator side, visual and tactile images are monitored in real time, and vibration motors are mounted on the leader robot’s gripper.
  • Figure 3: Overview of the proposed tracing policy learning framework. The inputs include robot kinematics $o^K_t$, visual image $o^V_t$, and tactile image $o^T_t$ collected from the follower robot, while the ground truth consists of action sequence $a_{t:t+k}$ recorded from the leader robot. Input features are first extracted using a mlp and cnn. These features are then concatenated and fed into a Transformer-based policy network, which is trained using a combination of three loss functions: local center loss, global task loss, and regularization loss.
  • Figure 4: Tactile images under different contact locations between the object and gripper. (a) Object grasped near the center of the tactile sensing region; (b) Object grasped near the front edge of the tactile sensing region; (c) Object grasped near the rear edge of the tactile sensing region. When the object is grasped near the edges of the sensing region, it is more likely to slip into an imperceptible area.
  • Figure 5: 1d and 2d deformable objects used in the experiments. (a) Seen objects; (b) Unseen objects. For each seen object, 25 demonstrations were collected. The trained models were tested on both seen and unseen objects.
  • ...and 2 more figures