Table of Contents
Fetching ...

Transferable Tactile Transformers for Representation Learning Across Diverse Sensors and Tasks

Jialiang Zhao, Yuxiang Ma, Lirui Wang, Edward H. Adelson

TL;DR

Faced with extreme heterogeneity across camera-based tactile sensors, the paper introduces FoTa, a large, diverse tactile dataset, and T3, a transferable tactile transformer framework. FoTa aggregates over 3 million tactile images from 13 sensors and 11 tasks into a unified WebDataset to support large-scale, unaligned pretraining. T3 employs sensor-specific encoders, a shared trunk, and task-specific decoders, trained in a two-stage pretraining (MAE and supervised) with optional task-specific fine-tuning, achieving zero-shot transfer in some cases and strong gains with limited data for others, while scaling with network size. The approach improves performance on long-horizon manipulation tasks, such as sub-millimeter electronics insertion, and is complemented by open datasets, code, and model checkpoints to promote broader adoption in tactile sensing research.

Abstract

This paper presents T3: Transferable Tactile Transformers, a framework for tactile representation learning that scales across multi-sensors and multi-tasks. T3 is designed to overcome the contemporary issue that camera-based tactile sensing is extremely heterogeneous, i.e. sensors are built into different form factors, and existing datasets were collected for disparate tasks. T3 captures the shared latent information across different sensor-task pairings by constructing a shared trunk transformer with sensor-specific encoders and task-specific decoders. The pre-training of T3 utilizes a novel Foundation Tactile (FoTa) dataset, which is aggregated from several open-sourced datasets and it contains over 3 million data points gathered from 13 sensors and 11 tasks. FoTa is the largest and most diverse dataset in tactile sensing to date and it is made publicly available in a unified format. Across various sensors and tasks, experiments show that T3 pre-trained with FoTa achieved zero-shot transferability in certain sensor-task pairings, can be further fine-tuned with small amounts of domain-specific data, and its performance scales with bigger network sizes. T3 is also effective as a tactile encoder for long horizon contact-rich manipulation. Results from sub-millimeter multi-pin electronics insertion tasks show that T3 achieved a task success rate 25% higher than that of policies trained with tactile encoders trained from scratch, or 53% higher than without tactile sensing. Data, code, and model checkpoints are open-sourced at https://t3.alanz.info

Transferable Tactile Transformers for Representation Learning Across Diverse Sensors and Tasks

TL;DR

Faced with extreme heterogeneity across camera-based tactile sensors, the paper introduces FoTa, a large, diverse tactile dataset, and T3, a transferable tactile transformer framework. FoTa aggregates over 3 million tactile images from 13 sensors and 11 tasks into a unified WebDataset to support large-scale, unaligned pretraining. T3 employs sensor-specific encoders, a shared trunk, and task-specific decoders, trained in a two-stage pretraining (MAE and supervised) with optional task-specific fine-tuning, achieving zero-shot transfer in some cases and strong gains with limited data for others, while scaling with network size. The approach improves performance on long-horizon manipulation tasks, such as sub-millimeter electronics insertion, and is complemented by open datasets, code, and model checkpoints to promote broader adoption in tactile sensing research.

Abstract

This paper presents T3: Transferable Tactile Transformers, a framework for tactile representation learning that scales across multi-sensors and multi-tasks. T3 is designed to overcome the contemporary issue that camera-based tactile sensing is extremely heterogeneous, i.e. sensors are built into different form factors, and existing datasets were collected for disparate tasks. T3 captures the shared latent information across different sensor-task pairings by constructing a shared trunk transformer with sensor-specific encoders and task-specific decoders. The pre-training of T3 utilizes a novel Foundation Tactile (FoTa) dataset, which is aggregated from several open-sourced datasets and it contains over 3 million data points gathered from 13 sensors and 11 tasks. FoTa is the largest and most diverse dataset in tactile sensing to date and it is made publicly available in a unified format. Across various sensors and tasks, experiments show that T3 pre-trained with FoTa achieved zero-shot transferability in certain sensor-task pairings, can be further fine-tuned with small amounts of domain-specific data, and its performance scales with bigger network sizes. T3 is also effective as a tactile encoder for long horizon contact-rich manipulation. Results from sub-millimeter multi-pin electronics insertion tasks show that T3 achieved a task success rate 25% higher than that of policies trained with tactile encoders trained from scratch, or 53% higher than without tactile sensing. Data, code, and model checkpoints are open-sourced at https://t3.alanz.info
Paper Structure (33 sections, 2 equations, 5 figures, 2 tables)

This paper contains 33 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Architecture illustration of Transferable Tactile Transformers (T3).T3 learns a shared representation across heterogeneous tactile sensors and downstream tasks with a shared trunk between sensor-specific encoders and task-specific decoders. The encoders and the shared trunk are constructed with transformer blocks. The decoder architectures are chosen according to the types of the tasks: we use transformer for generative tasks like reconstruction for masked auto encoding, MLP for classification tasks, and CNN + MLP for pose estimation.
  • Figure 2: FoTa dataset visualizations. (a) We show the mixture and distribution of constituent datasets, sensors, and tasks of the FoTa dataset. Note that not all tasks are utilized in the training of T3. (b) We visualize one tactile image from each constituent sensor in FoTa. Note that in the training of T3, similar sensors share encoders. For example, GelSight17 var. {1-4} share the same encoder, and GelSight var. {1-2} share the same encoder.
  • Figure 3: Experiments on Task Performance and Transferability of T3. (a) Eval performance with 4 network sizes (tiny, small, medium, large), 3 training schemes (train from scratch, fine-tune from pre-train 1, fine-tune from pre-train 2), and 2 amounts of fine-tuning data (half data and full data) (b) Eval performance with different masking ratios during Pre-training I. (c) Transferability test on a classification task. (d) Transferability test on a pose estimation task.
  • Figure 4: Real world sub-millimeter robotic insertion tasks. (a) Hardware setup. (b) The 3 evaluation electronics parts and their respective mounting places on a PCB. (c) Insertion success rate and average steps taken for each task. Policies trained with the T3 tactile encoder achieved the highest success rate and lowest averaged episode length.
  • Figure 5: Visualization of self-attention weights of the encoders and the trunk of pre-trained T3.