Table of Contents
Fetching ...

Dexterity from Touch: Self-Supervised Pre-Training of Tactile Representations with Robotic Play

Irmak Guzey, Ben Evans, Soumith Chintala, Lerrel Pinto

TL;DR

T-Dex addresses the challenge of dexterous manipulation with multi-fingered hands by leveraging self-supervised tactile representations learned from large-scale play data, followed by few-shot, non-parametric imitation that fuses tactile and visual information. The method shows that tactile pretraining substantially boosts performance over vision- or torque-only baselines across five contact-rich tasks, with notable gains as data from diverse play increases. Key contributions include (i) a tactile-focused pretraining pipeline using BYOL on hand-worn sensor data, (ii) a nearest-neighbor imitation framework that combines tactile and visual features, and (iii) extensive ablations highlighting the importance of tactile representations, data, and input architecture. The results demonstrate practical improvements in dexterous manipulation under occlusion and pave the way for data-efficient tactile-vision policies in real-world robots.

Abstract

Teaching dexterity to multi-fingered robots has been a longstanding challenge in robotics. Most prominent work in this area focuses on learning controllers or policies that either operate on visual observations or state estimates derived from vision. However, such methods perform poorly on fine-grained manipulation tasks that require reasoning about contact forces or about objects occluded by the hand itself. In this work, we present T-Dex, a new approach for tactile-based dexterity, that operates in two phases. In the first phase, we collect 2.5 hours of play data, which is used to train self-supervised tactile encoders. This is necessary to bring high-dimensional tactile readings to a lower-dimensional embedding. In the second phase, given a handful of demonstrations for a dexterous task, we learn non-parametric policies that combine the tactile observations with visual ones. Across five challenging dexterous tasks, we show that our tactile-based dexterity models outperform purely vision and torque-based models by an average of 1.7X. Finally, we provide a detailed analysis on factors critical to T-Dex including the importance of play data, architectures, and representation learning.

Dexterity from Touch: Self-Supervised Pre-Training of Tactile Representations with Robotic Play

TL;DR

T-Dex addresses the challenge of dexterous manipulation with multi-fingered hands by leveraging self-supervised tactile representations learned from large-scale play data, followed by few-shot, non-parametric imitation that fuses tactile and visual information. The method shows that tactile pretraining substantially boosts performance over vision- or torque-only baselines across five contact-rich tasks, with notable gains as data from diverse play increases. Key contributions include (i) a tactile-focused pretraining pipeline using BYOL on hand-worn sensor data, (ii) a nearest-neighbor imitation framework that combines tactile and visual features, and (iii) extensive ablations highlighting the importance of tactile representations, data, and input architecture. The results demonstrate practical improvements in dexterous manipulation under occlusion and pave the way for data-efficient tactile-vision policies in real-world robots.

Abstract

Teaching dexterity to multi-fingered robots has been a longstanding challenge in robotics. Most prominent work in this area focuses on learning controllers or policies that either operate on visual observations or state estimates derived from vision. However, such methods perform poorly on fine-grained manipulation tasks that require reasoning about contact forces or about objects occluded by the hand itself. In this work, we present T-Dex, a new approach for tactile-based dexterity, that operates in two phases. In the first phase, we collect 2.5 hours of play data, which is used to train self-supervised tactile encoders. This is necessary to bring high-dimensional tactile readings to a lower-dimensional embedding. In the second phase, given a handful of demonstrations for a dexterous task, we learn non-parametric policies that combine the tactile observations with visual ones. Across five challenging dexterous tasks, we show that our tactile-based dexterity models outperform purely vision and torque-based models by an average of 1.7X. Finally, we provide a detailed analysis on factors critical to T-Dex including the importance of play data, architectures, and representation learning.
Paper Structure (40 sections, 21 figures, 3 tables)

This paper contains 40 sections, 21 figures, 3 tables.

Figures (21)

  • Figure 1: T-Dex learns dexterous policies from high-dimensional tactile sensors on a multi-fingered robot hand (top). Combined with vision, our tactile representations are crucial to learn fine-grained manipulation tasks (bottom).
  • Figure 2: Hardware setting of T-Dex. We use an Oculus Headset to teleoperate the Allegro hand and the built in Kinova joystick to control the arm. Visual observations are streamed through two different Realsense cameras and tactile observations are saved with XELA touch sensors on the Allegro hand.
  • Figure 3: Visualization of some of the play tasks. We play with grasping, pinching, moving objects, and other in-hand manipulation tasks.
  • Figure 4: An overview of the T-Dex framework. Left: we train tactile representations using BYOL on a large play dataset. Right: we leverage the learned representations using nearest neighbors imitation.
  • Figure 5: Visualization of robot rollouts from T-Dex policies. Note the severe visual occlusions when the robot makes contact with the object.
  • ...and 16 more figures