Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features

Justin Kerr; Huang Huang; Albert Wilcox; Ryan Hoque; Jeffrey Ichnowski; Roberto Calandra; Ken Goldberg

Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features

Justin Kerr, Huang Huang, Albert Wilcox, Ryan Hoque, Jeffrey Ichnowski, Roberto Calandra, Ken Goldberg

TL;DR

The paper tackles the challenge of requiring labeled data for deformable garment manipulation by proposing a self-supervised visuo-tactile pretraining framework (SSVTP). It collects spatially aligned visual and tactile image pairs with a custom end-effector and trains encoders to embed modalities into a shared latent space via cross-modal contrastive learning while decoupling rotation through a separate rotation predictor. The pretrained representations are deployed without fine-tuning on five downstream tasks—three active sliding perception tasks and two passive perception tasks—achieving up to 100% success on some settings. The work demonstrates robust cross-modal localization and feature tracking on garments, highlighting the potential for task-agnostic visuo-tactile representations and providing a real-world dataset of 4500 aligned pairs.

Abstract

Humans make extensive use of vision and touch as complementary senses, with vision providing global information about the scene and touch measuring local information during manipulation without suffering from occlusions. While prior work demonstrates the efficacy of tactile sensing for precise manipulation of deformables, they typically rely on supervised, human-labeled datasets. We propose Self-Supervised Visuo-Tactile Pretraining (SSVTP), a framework for learning multi-task visuo-tactile representations in a self-supervised manner through cross-modal supervision. We design a mechanism that enables a robot to autonomously collect precisely spatially-aligned visual and tactile image pairs, then train visual and tactile encoders to embed these pairs into a shared latent space using cross-modal contrastive loss. We apply this latent space to downstream perception and control of deformable garments on flat surfaces, and evaluate the flexibility of the learned representations without fine-tuning on 5 tasks: feature classification, contact localization, anomaly detection, feature search from a visual query (e.g., garment feature localization under occlusion), and edge following along cloth edges. The pretrained representations achieve a 73-100% success rate on these 5 tasks.

Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features

TL;DR

Abstract

Paper Structure (31 sections, 7 figures, 2 tables)

This paper contains 31 sections, 7 figures, 2 tables.

Introduction
Related Work
Tactile Sensing for Robotics
Visuo-Tactile Cross-Modal Learning
Problem Statement
Methods
Self-Supervised Data Collection
Latent Space Training
Rotation Prediction Network
Sliding Perception Primitives
Anomaly Detection
Feature Search
Edge Following
Passive Perception Modules
Contact Localizer
...and 16 more sections

Figures (7)

Figure 1: SSVTP Overview. (a) We design a self-supervised framework described in Section \ref{['ssec:data_col']} to collect 4500 spatially aligned visual and tactile images, and use this dataset to learn a shared visuo-tactile latent space $\mathcal{Z}$ as described in Section \ref{['ssec:encoders']}. We apply this latent space without fine-tuning for 3 active sliding perception tasks: Anomaly Detection (b), Feature Search (c), and Edge Following (d).
Figure 2: Custom Hardware Design: (a) A CAD model of the mechanical mount designed for data collection with the RGB camera and tactile sensor. (b) The actual sensors and the mount.
Figure 3: Self-Supervised Visuo-Tactile Data Collection: An overview of the data collection pipeline. (a) For each sample the robot first uses an RGB camera to take a closely cropped image of the texture. (b) The robot then adjusts its end effector to take a tactile reading at the same location. (c) We collect data on 10 deformable surface environments, one of which is shown here.
Figure 4: Active Sliding Perception Tasks. (a) Anomaly Detection: The robot uses the learned tactile encoder to detect a knot while sliding over a black thread on a black surface. (b) Feature Search: Here the system is given a query visual image of a zipper and searches the workspace for a tactile reading that is sufficiently similar to the query (i.e., the texture of a zipper). (c) Edge Following: The robot uses rotation predictions to follow a curved cable. Visual inputs are colorized for clarity, the visual network takes in grayscale.
Figure 5: Tactile Localization: An input tactile image of the zipper is compared to discreteized patches from a visual image of the entire scene. In this example, the heatmap shows high probability of a match near the zipper. Note that visual images are colorized for clarity, the visual encoder takes in grayscale.
...and 2 more figures

Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features

TL;DR

Abstract

Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features

Authors

TL;DR

Abstract

Table of Contents

Figures (7)