Dexterity from Touch: Self-Supervised Pre-Training of Tactile Representations with Robotic Play
Irmak Guzey, Ben Evans, Soumith Chintala, Lerrel Pinto
TL;DR
T-Dex addresses the challenge of dexterous manipulation with multi-fingered hands by leveraging self-supervised tactile representations learned from large-scale play data, followed by few-shot, non-parametric imitation that fuses tactile and visual information. The method shows that tactile pretraining substantially boosts performance over vision- or torque-only baselines across five contact-rich tasks, with notable gains as data from diverse play increases. Key contributions include (i) a tactile-focused pretraining pipeline using BYOL on hand-worn sensor data, (ii) a nearest-neighbor imitation framework that combines tactile and visual features, and (iii) extensive ablations highlighting the importance of tactile representations, data, and input architecture. The results demonstrate practical improvements in dexterous manipulation under occlusion and pave the way for data-efficient tactile-vision policies in real-world robots.
Abstract
Teaching dexterity to multi-fingered robots has been a longstanding challenge in robotics. Most prominent work in this area focuses on learning controllers or policies that either operate on visual observations or state estimates derived from vision. However, such methods perform poorly on fine-grained manipulation tasks that require reasoning about contact forces or about objects occluded by the hand itself. In this work, we present T-Dex, a new approach for tactile-based dexterity, that operates in two phases. In the first phase, we collect 2.5 hours of play data, which is used to train self-supervised tactile encoders. This is necessary to bring high-dimensional tactile readings to a lower-dimensional embedding. In the second phase, given a handful of demonstrations for a dexterous task, we learn non-parametric policies that combine the tactile observations with visual ones. Across five challenging dexterous tasks, we show that our tactile-based dexterity models outperform purely vision and torque-based models by an average of 1.7X. Finally, we provide a detailed analysis on factors critical to T-Dex including the importance of play data, architectures, and representation learning.
