Table of Contents
Fetching ...

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Liang Heng, Haoran Geng, Kaifeng Zhang, Pieter Abbeel, Jitendra Malik

TL;DR

ViTacFormer presents a unified visuo-tactile framework that fuses vision and touch through cross-attention while forecasting future tactile signals via an autoregressive head. A curriculum-guided training strategy stabilizes cross-modal learning, enabling robust imitation learning for bi-manual dexterous manipulation. Empirical results on four short-horizon tasks and a long-horizon hamburger task show approximately 50% higher success rates than strong baselines and demonstrate the first real-robot completion of very long-horizon dexterous manipulation. The work advances real-world visuo-tactile robotics by delivering a scalable, generalizable cross-modal representation that supports precise, adaptive control.

Abstract

Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

TL;DR

ViTacFormer presents a unified visuo-tactile framework that fuses vision and touch through cross-attention while forecasting future tactile signals via an autoregressive head. A curriculum-guided training strategy stabilizes cross-modal learning, enabling robust imitation learning for bi-manual dexterous manipulation. Empirical results on four short-horizon tasks and a long-horizon hamburger task show approximately 50% higher success rates than strong baselines and demonstrate the first real-robot completion of very long-horizon dexterous manipulation. The work advances real-world visuo-tactile robotics by delivering a scalable, generalizable cross-modal representation that supports precise, adaptive control.

Abstract

Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.

Paper Structure

This paper contains 30 sections, 3 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: An overview of our system hardware and teleoperation setup. (a) Our hardware system setup. The hardware consists of two Realman robot arms, each equipped with a SharpaWave dexterous hand. Visual observations are obtained via two wrist cameras and one binocular camera. (b) Teleoperator with exoskeleton gloves and VR headset. (c) VR interface with binocular and wrist views, and tactile feedback overlay.
  • Figure 2: The neural network architecture for ViTacFormer is a conditional variational auto-encoder. Left: a transformer-based encoder maps action sequence and robot proprioception to action style variable $z$. Right: a transformer-based encoder-decoder uses style variable $z$, robot proprioception (joints), and visuo-tactile observations to auto-regressively predict future tactile signals and generate actions.
  • Figure 3: Cross-attention-based multimodal integration between visual and tactile observations.
  • Figure 4: Four short-horizon visuo-tactile tasks, from left to right, i.e., peg insertion, cap twist, vase wipe, and book flip.
  • Figure 5: Successful model rollout on long-horizon task, i.e., making hamburger. We show the successful model rollout with keyframes in 11 stages. The first row represents the robot hand turning the brand to "open". The second row represents the robot hand shoveling meat to bread. The third row represents the robot hand assembling the hamburger. The fourth row represents the robot hand handing over the hamburger to the plate. The fifth row represents the robot hand turning the brand to "close".
  • ...and 7 more figures