VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies

Abraham George; Selam Gano; Pranav Katragadda; Amir Barati Farimani

VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies

Abraham George, Selam Gano, Pranav Katragadda, Amir Barati Farimani

TL;DR

This paper addresses how tactile information can be integrated into imitation learning for manipulation by introducing a visuo-tactile pretraining scheme that aligns vision and tactile encoders via a temporally informed contrastive loss. The pretrained encoders are then utilized in two imitation-learning frameworks, ACT and Diffusion Policy, with two modes: visuo-tactile and vision-only (VITaL pretraining). Experiments on cable plugging and block stacking show that multimodal pretraining modestly benefits visuo-tactile policies while substantially boosting vision-only policies, sometimes matching or surpassing tactile-enabled ones. The results highlight tactile data's value for improving contact-awareness and reducing wear, suggesting that task-specific tactile data can enhance non-tactile policies, and point to future work on large-scale visuo-tactile pretraining and transfer. Overall, the work demonstrates that visuo-tactile pretraining can extend tactile benefits to vision-only manipulation policies, enabling high performance without deploying tactile sensors at inference.

Abstract

Tactile information is a critical tool for dexterous manipulation. As humans, we rely heavily on tactile information to understand objects in our environments and how to interact with them. We use touch not only to perform manipulation tasks but also to learn how to perform these tasks. Therefore, to create robotic agents that can learn to complete manipulation tasks at a human or super-human level of performance, we need to properly incorporate tactile information into both skill execution and skill learning. In this paper, we investigate how we can incorporate tactile information into imitation learning platforms to improve performance on manipulation tasks. We show that incorporating visuo-tactile pretraining improves imitation learning performance, not only for tactile agents (policies that use tactile information at inference), but also for non-tactile agents (policies that do not use tactile information at inference). For these non-tactile agents, pretraining with tactile information significantly improved performance (for example, improving the accuracy on USB plugging from 20% to 85%), reaching a level on par with visuo-tactile agents, and even surpassing them in some cases. For demonstration videos and access to our codebase, see the project website: https://sites.google.com/andrew.cmu.edu/visuo-tactile-pretraining

VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 7 figures, 1 algorithm)

This paper contains 17 sections, 2 equations, 7 figures, 1 algorithm.

Introduction
Related Works
Learning Control Policies using Tactile Sensors
Contrastive Pretraining
Imitation Learning for Robotic Manipulation
Action Chunking Transformers
Diffusion Policy
Methods
Pretraining
Imitation Learning Frameworks
Action Chunking Transformer
Diffusion Policy
Data Collection
Experimental Evaluation
Cable Plugging
...and 2 more sections

Figures (7)

Figure 1: Diagram of our approach. First, a vision encoder and a tactile encoder are pretrained on the collected demonstrations using a temporally informed multi-modal contrastive loss. Then, the pretrained encoders are used in an imitation learning framework, either for visuo-tactile control (left) or vision-only control (right).
Figure 2: Contrastive loss visualization. A series of visual observations $V_1, V_2, ..., V_N$ and tactile observations $T_1, T_2, ..., T_N$ are collected, and the vision encoder and tactile encoder are trained to make the embeddings from the same timestep similar while forcing apart the embeddings from different time steps.
Figure 3: Imitation learning networks. ACT (left) is trained as an autoencoder, predicting a sequence of actions at each timestep ($a_t$). At inference, the latent variable, $z$, is set to 0. The network is queried each timestep, and all action predictions for that timestep are ensembled using a weighted average. Diffusion Policy (right) learns to predict noise applied to an action sequence. During inference, the action sequence is initialized with Gaussian noise and is iteratively denoised to produce output actions.
Figure 4: Expermental Setup. A GelSight captures tactile observations, while 6 Realsense cameras observe the scene (only two can be seen above; three are out of view and another is mounted to the back of the end-effector). Cable Plugging Task (left): The robot must retrieve the USB cable from its holder, plug it into the front port on the USB hub, and release it. Rectangle Block Stacking (top-right): The robot must pick up the blue block, stack it on top of the red block, and release it without knocking over the blocks. Cube Block Stacking (bottom-right): The robot must pick up the red block, stack it on the green block, and release it.
Figure 5: GelSight sensor outputs, showing the RGB images from the GelSight's camera and the processed strain data for both the covered and uncovered gelsight. The strain map is rendered in the LAB color space, with the brightness of each pixel corresponding to the normal strain (depth), and the color corresponding to the tangential strains, with strains in x shown on the blue-yellow spectrum, and strains in y shown on the red-green spectrum.
...and 2 more figures

VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies

TL;DR

Abstract

VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (7)