UniT: Data Efficient Tactile Representation with Generalization to Unseen Objects

Zhengtong Xu; Raghava Uppuluri; Xinwei Zhang; Cael Fitch; Philip Glen Crandall; Wan Shou; Dongyi Wang; Yu She

UniT: Data Efficient Tactile Representation with Generalization to Unseen Objects

Zhengtong Xu, Raghava Uppuluri, Xinwei Zhang, Cael Fitch, Philip Glen Crandall, Wan Shou, Dongyi Wang, Yu She

TL;DR

Through extensive experimentation, UniT is shown to be a simple-to-train, plug-and-play, yet widely effective method for tactile representation learning.

Abstract

UniT is an approach to tactile representation learning, using VQGAN to learn a compact latent space and serve as the tactile representation. It uses tactile images obtained from a single simple object to train the representation with generalizability. This tactile representation can be zero-shot transferred to various downstream tasks, including perception tasks and manipulation policy learning. Our benchmarkings on in-hand 3D pose and 6D pose estimation tasks and a tactile classification task show that UniT outperforms existing visual and tactile representation learning methods. Additionally, UniT's effectiveness in policy learning is demonstrated across three real-world tasks involving diverse manipulated objects and complex robot-object-environment interactions. Through extensive experimentation, UniT is shown to be a simple-to-train, plug-and-play, yet widely effective method for tactile representation learning. For more details, please refer to our open-source repository https://github.com/ZhengtongXu/UniT and the project website https://zhengtongxu.github.io/unit-website/.

UniT: Data Efficient Tactile Representation with Generalization to Unseen Objects

TL;DR

Through extensive experimentation, UniT is shown to be a simple-to-train, plug-and-play, yet widely effective method for tactile representation learning.

Abstract

Paper Structure (15 sections, 7 figures, 6 tables)

This paper contains 15 sections, 7 figures, 6 tables.

Introduction
Related Work
Tactile-involved Imitation Learning
Representation Learning in Imitation Learning and Tactile Sensing
Background: VQGAN autoencoder
Method
Training Pipeline
Train with Simple and Single Object
Decoder Head for Downstream Tasks
Reconstruction Experiments
Tactile Perception Experiments
Pose Estimation
Classification
Supervised Policy Learning Experiments
Future Work

Figures (7)

Figure 1: Pipeline of UniT representation training.
Figure 2: Decoder architecture of implementing UniT representation to downstream tasks.
Figure 3: Example results of UniT reconstruction of diverse unseen objects. Rec. represents reconstruction while Ground. represents ground truth. Sensor 1, 2, and 3 are three different GelSight minis. One training dataset for the autoencoder is only collected on one sensor.
Figure 5: Example results of marker tracking. To evaluate if the learned representations consist of information of the dynamic marker motion, we implement marker tracking yuan2017gelsight on ground truth images and the corresponding image reconstructions by MAE and UniT.
Figure 6: Visualization of latent spaces. We transformed tactile images of size 3×128×160 into latent spaces of size 3×16×20 and visualized these latent spaces as RGB images. Both VQGAN and CNN autoencoders are trained solely on the Allen key dataset. We tested three different unseen objects. The three images on the left are from artificial strawberry, the three in the middle are from nut, and the three on the right are from screw.
...and 2 more figures

UniT: Data Efficient Tactile Representation with Generalization to Unseen Objects

TL;DR

Abstract

UniT: Data Efficient Tactile Representation with Generalization to Unseen Objects

Authors

TL;DR

Abstract

Table of Contents

Figures (7)