Table of Contents
Fetching ...

How to Train your Tactile Model: Tactile Perception with Multi-fingered Robot Hands

Christopher J. Ford, Kaichen Shi, Laura Butcher, Nathan F. Lepora, Efi Psomopoulou

Abstract

Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining for each new sensor due to differences in lens properties, illumination, and sensor wear. Here we introduce TacViT, a novel tactile perception model based on Vision Transformers, designed to generalize on new sensor data. TacViT leverages global self-attention mechanisms to extract robust features from tactile images, enabling accurate contact property inference even on previously unseen sensors. This capability significantly reduces the need for data collection and retraining, accelerating the deployment of new sensors. We evaluate TacViT on sensors for a five-fingered robot hand and demonstrate its superior generalization performance compared to CNNs. Our results highlight TacViTs potential to make tactile sensing more scalable and practical for real-world robotic applications.

How to Train your Tactile Model: Tactile Perception with Multi-fingered Robot Hands

Abstract

Rapid deployment of new tactile sensors is essential for scalable robotic manipulation, especially in multi-fingered hands equipped with vision-based tactile sensors. However, current methods for inferring contact properties rely heavily on convolutional neural networks (CNNs), which, while effective on known sensors, require large, sensor-specific datasets. Furthermore, they require retraining for each new sensor due to differences in lens properties, illumination, and sensor wear. Here we introduce TacViT, a novel tactile perception model based on Vision Transformers, designed to generalize on new sensor data. TacViT leverages global self-attention mechanisms to extract robust features from tactile images, enabling accurate contact property inference even on previously unseen sensors. This capability significantly reduces the need for data collection and retraining, accelerating the deployment of new sensors. We evaluate TacViT on sensors for a five-fingered robot hand and demonstrate its superior generalization performance compared to CNNs. Our results highlight TacViTs potential to make tactile sensing more scalable and practical for real-world robotic applications.

Paper Structure

This paper contains 19 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 3: Sensors from a five-fingered robot hand equipped with vision-based tactile sensors (VBTS) are used to evaluate TacViT and CNN models. Three experiments are conducted to assess model performance: training and testing on the same sensor (Tr1–Te1), training on all sensors and testing on one seen during training (Tr5–Te1), and training on four sensors and testing on an unseen sensor (Tr4–TeU). These experiments reveal that TacViT generalizes to new sensors without retraining.
  • Figure 4: Examples of tactile images from different sensors: Evident variation in the tactile images collected from different TacTip-style sensors.
  • Figure 5: TacViT pipeline: a tactile image is divided into patches which are linearized and fed through a transformer encoder. The transformer extracts feature embeddings from the image, which is then passed through a regression head consisting of several fully connected (FC) layers to output pose predictions.
  • Figure 6: Strip plots showing the distribution of mean absolute error (MAE) using TacViT and CNN on the three evaluation experiments. Each diamond point represents each model trained per method and per experiment (encompassing all distinct training combinations).
  • Figure 7: Example scatter plots displaying the predicted vs ground truth value for each pose and force parameter for each experiment using TacViT and CNN. Significant performance degradation can be seen in the CNN, whereas TacViT adapts better to data from an unseen domain.