Table of Contents
Fetching ...

Contrastive Touch-to-Touch Pretraining

Samanta Rodriguez, Yiming Dou, William van den Bogert, Miquel Oller, Kevin So, Andrew Owens, Nima Fazeli

TL;DR

This paper leverages contrastive learning to integrate tactile signals from two different sensors into a shared embedding space, using a dataset in which the same objects are probed with multiple sensors.

Abstract

Today's tactile sensors have a variety of different designs, making it challenging to develop general-purpose methods for processing touch signals. In this paper, we learn a unified representation that captures the shared information between different tactile sensors. Unlike current approaches that focus on reconstruction or task-specific supervision, we leverage contrastive learning to integrate tactile signals from two different sensors into a shared embedding space, using a dataset in which the same objects are probed with multiple sensors. We apply this approach to paired touch signals from GelSlim and Soft Bubble sensors. We show that our learned features provide strong pretraining for downstream pose estimation and classification tasks. We also show that our embedding enables models trained using one touch sensor to be deployed using another without additional training. Project details can be found at https://www.mmintlab.com/research/cttp/.

Contrastive Touch-to-Touch Pretraining

TL;DR

This paper leverages contrastive learning to integrate tactile signals from two different sensors into a shared embedding space, using a dataset in which the same objects are probed with multiple sensors.

Abstract

Today's tactile sensors have a variety of different designs, making it challenging to develop general-purpose methods for processing touch signals. In this paper, we learn a unified representation that captures the shared information between different tactile sensors. Unlike current approaches that focus on reconstruction or task-specific supervision, we leverage contrastive learning to integrate tactile signals from two different sensors into a shared embedding space, using a dataset in which the same objects are probed with multiple sensors. We apply this approach to paired touch signals from GelSlim and Soft Bubble sensors. We show that our learned features provide strong pretraining for downstream pose estimation and classification tasks. We also show that our embedding enables models trained using one touch sensor to be deployed using another without additional training. Project details can be found at https://www.mmintlab.com/research/cttp/.

Paper Structure

This paper contains 15 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Contrastive Touch-to-touch Pretraining (CTTP). We learn a joint embedding between signals from different tactile sensors. The resulting model learns a touch feature representation that conveys the physical properties of the touched object that are provided by both sensors, which is useful pretraining for downstream tasks. The embedding also enables "zero shot" transfer of downstream touch models from one sensor to another.
  • Figure 2: Representation Learning Models Comparison on Classification Accuracy. We compare CTTP to our baselines on the downstream task of tool classification. We evaluate their performance in three areas: generalization within a single visuo-tactile sensor, generalization across different visuo-tactile sensors, and generalization to unseen tools. For reference, the dotted line represents random chance performance.
  • Figure 3: CTTP Batch Size Comparison on Classification Accuracy. We compare CTTP trained on different batch sizes on the downstream task of tool classification. We evaluate their performance in three areas: generalization within a single visuo-tactile sensor, generalization across different visuo-tactile sensors, and generalization to unseen tools. For reference, the dotted line represents random chance performance.
  • Figure 4: TSNE Comparison. we present the results of a t-SNE analysis on embeddings from both seen and unseen tools in our dataset. We conduct this analysis for CTTP, several baselines models (top), and CTTP trained with different batch sizes (bottom). For this analysis, we focus on visualizing the structure and relationships between the embeddings, focusing on tool differentiation (gray and pink) and sensor alignment (maize and blue). We consider the model successful when the t-SNE shows groupings of tools and, at the same time, the sensor colors overlap (sensors are aligned). Our CTTP model is trained on a batch size of 128 (bottom).
  • Figure 5: Insertion Task. We use our representation to perform peg insertion tasks using both tool classification and in-hand pose estimation. a) Our testing setup using each of the three unseen tools and corresponding holes. After handing the tool to the robot, classification occurs which allows the robot to select the correct hole. b) The CTTP-generated latent space performs far better in cross-sensor task transfer.