Table of Contents
Fetching ...

Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation

Samanta Rodriguez, Yiming Dou, Miquel Oller, Andrew Owens, Nima Fazeli

TL;DR

Cross-modal prediction between touch sensors is addressed by performing cross-modal prediction between touch sensors: given the tactile signal from one sensor, a generative model is used to estimate how the same physical contact would be perceived by another sensor.

Abstract

Today's touch sensors come in many shapes and sizes. This has made it challenging to develop general-purpose touch processing methods since models are generally tied to one specific sensor design. We address this problem by performing cross-modal prediction between touch sensors: given the tactile signal from one sensor, we use a generative model to estimate how the same physical contact would be perceived by another sensor. This allows us to apply sensor-specific methods to the generated signal. We implement this idea by training a diffusion model to translate between the popular GelSlim and Soft Bubble sensors. As a downstream task, we perform in-hand object pose estimation using GelSlim sensors while using an algorithm that operates only on Soft Bubble signals. The dataset, the code, and additional details can be found at https://www.mmintlab.com/research/touch2touch/.

Touch2Touch: Cross-Modal Tactile Generation for Object Manipulation

TL;DR

Cross-modal prediction between touch sensors is addressed by performing cross-modal prediction between touch sensors: given the tactile signal from one sensor, a generative model is used to estimate how the same physical contact would be perceived by another sensor.

Abstract

Today's touch sensors come in many shapes and sizes. This has made it challenging to develop general-purpose touch processing methods since models are generally tied to one specific sensor design. We address this problem by performing cross-modal prediction between touch sensors: given the tactile signal from one sensor, we use a generative model to estimate how the same physical contact would be perceived by another sensor. This allows us to apply sensor-specific methods to the generated signal. We implement this idea by training a diffusion model to translate between the popular GelSlim and Soft Bubble sensors. As a downstream task, we perform in-hand object pose estimation using GelSlim sensors while using an algorithm that operates only on Soft Bubble signals. The dataset, the code, and additional details can be found at https://www.mmintlab.com/research/touch2touch/.
Paper Structure (9 sections, 7 figures, 4 tables)

This paper contains 9 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Transferring manipulation methods between touch sensors using cross-modal prediction. We execute an object manipulation skill designed for one touch sensor (Soft Bubble) on a robot equipped with another sensor (GelSlim). To do this, we use a cross-modal diffusion model to translate one touch signal to another --- that is, we predict what the object would have felt like if it were manipulated with Soft Bubble rather than GelSlim. The robot then uses this prediction to perform its action.
  • Figure 2: Collecting a dataset of paired touch signals. To obtain paired touch data, we have a robot probe an object at the same position using two different touch sensors.
  • Figure 3: GelSlim and Soft Bubble touch signals. We show 2 of the 12 tools from our dataset and the corresponding GelSlim and Soft Bubble images. The dashed rectangle over the Soft Bubble image indicates the (much smaller) contact area covered by GelSlim.
  • Figure 4: Downstream object manipulation tasks. A robot arm is equipped with the GelSlim sensor. It uses our model to estimate a corresponding Soft Bubble signal. Using this signal, it successfully completes stacking and insertion tasks, using an algorithm that operates on the Soft Bubble signal.
  • Figure 5: Cross-modal tactile generation. We predict a Soft Bubble signal from a GelSlim signal using a diffusion model. To provide a conditioning signal, GelSlim images are encoded into 2D feature maps using a ResNet and concatenated channel-wise with the latent code.
  • ...and 2 more figures