Table of Contents
Fetching ...

Sensor-Invariant Tactile Representation

Harsh Gupta, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan

TL;DR

This work tackles the hard problem of transferring tactile perception models across diverse vision-based sensors by learning Sensor-Invariant Tactile Representations (SITR). It combines calibration images, a transformer-based encoder, normal-map reconstruction, and supervised contrastive learning trained on a large synthetic dataset with 100 sensor configurations, then evaluates zero-shot transfer on real GelSight-based sensors for shape reconstruction, object classification, and pose estimation. SITR consistently outperforms strong baselines in inter-sensor transfer, demonstrating robust geometry preservation and cross-sensor generalization. The approach paves the way for scalable data/model transfer in tactile sensing and suggests broader applicability to varied sensor designs.

Abstract

High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This limitation hinders the ability to transfer models or knowledge learned from one sensor to another. To address this, we introduce a novel method for extracting Sensor-Invariant Tactile Representations (SITR), enabling zero-shot transfer across optical tactile sensors. Our approach utilizes a transformer-based architecture trained on a diverse dataset of simulated sensor designs, allowing it to generalize to new sensors in the real world with minimal calibration. Experimental results demonstrate the method's effectiveness across various tactile sensing applications, facilitating data and model transferability for future advancements in the field.

Sensor-Invariant Tactile Representation

TL;DR

This work tackles the hard problem of transferring tactile perception models across diverse vision-based sensors by learning Sensor-Invariant Tactile Representations (SITR). It combines calibration images, a transformer-based encoder, normal-map reconstruction, and supervised contrastive learning trained on a large synthetic dataset with 100 sensor configurations, then evaluates zero-shot transfer on real GelSight-based sensors for shape reconstruction, object classification, and pose estimation. SITR consistently outperforms strong baselines in inter-sensor transfer, demonstrating robust geometry preservation and cross-sensor generalization. The approach paves the way for scalable data/model transfer in tactile sensing and suggests broader applicability to varied sensor designs.

Abstract

High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This limitation hinders the ability to transfer models or knowledge learned from one sensor to another. To address this, we introduce a novel method for extracting Sensor-Invariant Tactile Representations (SITR), enabling zero-shot transfer across optical tactile sensors. Our approach utilizes a transformer-based architecture trained on a diverse dataset of simulated sensor designs, allowing it to generalize to new sensors in the real world with minimal calibration. Experimental results demonstrate the method's effectiveness across various tactile sensing applications, facilitating data and model transferability for future advancements in the field.

Paper Structure

This paper contains 34 sections, 2 equations, 29 figures, 7 tables.

Figures (29)

  • Figure 1: Vision-based tactile sensors vary in both optical design and physical properties. Even with the same contact object, a screw, the tactile images produced by each sensor differ significantly. These variations highlight the challenge of transferring models from one sensor to another.
  • Figure 2: Our sensor-invariant representation learning framework. Each tactile image $x$ is paired with a set of calibration images $c$. We patchify and linearly project $x$ and $c$ to tokens. Additionally, the $c$ patches are region-wise stacked before projection. We concatenate the input tokens with a class token $z$ and pass it through a transformer encoder. The class token $z$ is trained with SCL, while patch tokens are supervised by normal map reconstruction loss. We highlight in grey the concatenation of the output class token and patch tokens as our Sensor-Invariant Tactile Representation (SITR) for downstream tasks.
  • Figure 3: Calibration images used in SITR, obtained by pressing two objects—a 4mm ball and a cube corner—at nine different locations each in a $3\times 3$ grid.
  • Figure 4: Demonstration of our physics-based rendering (PBR) model to simulate GelSight sensors. We parameterize the sensor's optical design in the environment.
  • Figure 5: Reconstruction examples for various sensors. The top row shows input tactile images, the middle row presents 3D reconstructions, and the bottom row shows the contact objects. Simulated sensors (Simulation 1 and 2) are in the training set, while real sensors (GelSight Mini, DIGIT, Hex, Wedge) are not.
  • ...and 24 more figures