Table of Contents
Fetching ...

3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing

Binghao Huang, Yixuan Wang, Xinyi Yang, Yiyue Luo, Yunzhu Li

TL;DR

3D-ViTac tackles the need for integrated visuo-tactile sensing in dexterous robotic manipulation. It introduces dense piezoresistive tactile sensors on soft grippers and a unified 3D visuo-tactile representation that preserves spatial relations between modalities, enabling learning with a diffusion-based policy. The approach yields improved performance over vision-only baselines, particularly under visual occlusion and in long-horizon in-hand tasks, demonstrated on real hardware with low-cost components. This work advances practical, contact-rich manipulation by combining scalable tactile sensing with explicit 3D fusion and diffusion-based imitation learning, and it provides a pathway toward robust, real-world dexterity.

Abstract

Tactile and visual perception are both crucial for humans to perform fine-grained interactions with their environment. Developing similar multi-modal sensing capabilities for robots can significantly enhance and expand their manipulation skills. This paper introduces \textbf{3D-ViTac}, a multi-modal sensing and learning system designed for dexterous bimanual manipulation. Our system features tactile sensors equipped with dense sensing units, each covering an area of 3$mm^2$. These sensors are low-cost and flexible, providing detailed and extensive coverage of physical contacts, effectively complementing visual information. To integrate tactile and visual data, we fuse them into a unified 3D representation space that preserves their 3D structures and spatial relationships. The multi-modal representation can then be coupled with diffusion policies for imitation learning. Through concrete hardware experiments, we demonstrate that even low-cost robots can perform precise manipulations and significantly outperform vision-only policies, particularly in safe interactions with fragile items and executing long-horizon tasks involving in-hand manipulation. Our project page is available at \url{https://binghao-huang.github.io/3D-ViTac/}.

3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing

TL;DR

3D-ViTac tackles the need for integrated visuo-tactile sensing in dexterous robotic manipulation. It introduces dense piezoresistive tactile sensors on soft grippers and a unified 3D visuo-tactile representation that preserves spatial relations between modalities, enabling learning with a diffusion-based policy. The approach yields improved performance over vision-only baselines, particularly under visual occlusion and in long-horizon in-hand tasks, demonstrated on real hardware with low-cost components. This work advances practical, contact-rich manipulation by combining scalable tactile sensing with explicit 3D fusion and diffusion-based imitation learning, and it provides a pathway toward robust, real-world dexterity.

Abstract

Tactile and visual perception are both crucial for humans to perform fine-grained interactions with their environment. Developing similar multi-modal sensing capabilities for robots can significantly enhance and expand their manipulation skills. This paper introduces \textbf{3D-ViTac}, a multi-modal sensing and learning system designed for dexterous bimanual manipulation. Our system features tactile sensors equipped with dense sensing units, each covering an area of 3. These sensors are low-cost and flexible, providing detailed and extensive coverage of physical contacts, effectively complementing visual information. To integrate tactile and visual data, we fuse them into a unified 3D representation space that preserves their 3D structures and spatial relationships. The multi-modal representation can then be coupled with diffusion policies for imitation learning. Through concrete hardware experiments, we demonstrate that even low-cost robots can perform precise manipulations and significantly outperform vision-only policies, particularly in safe interactions with fragile items and executing long-horizon tasks involving in-hand manipulation. Our project page is available at \url{https://binghao-huang.github.io/3D-ViTac/}.

Paper Structure

This paper contains 31 sections, 2 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: We propose 3D-ViTac, a multi-modal sensing and learning system for dexterous bimanual manipulation. This system features flexible, scalable, low-cost tactile sensors, each finger equipped with a $16\times16$ sensor array. To demonstrate the capabilities of our system in performing precise manipulations, we showcase four tasks that utilize the force-related and in-hand position information provided by the tactile sensors.
  • Figure 2: Our Tactile Sensing Platform.Part (a) shows our bimanual tactile integrated system setup. We deploy four tactile sensor pads (two for each hand) on the soft grippers. The tactile readings are displayed on the back screen. Part (b) describes the design of our tactile-integrated soft gripper. Each sensor comprises 256 sensing units, with their locations on the gripper shown in (ii). We have also designed a readout board to collect tactile signals and forward them to the host computer. Part (c) shows the physical characteristics and sensing consistency of our tactile sensors (details in Sec. \ref{['sec:exp']}).
  • Figure 3: Visuo-Tactile Policy.Part (a) shows the real-world setup and the manipulated objects. Part (b) illustrates the processing of visual data (upper block) and tactile data (bottom block), followed by their integration within the same 3D coordinates. From the visualization of the tactile signals, depending on the relative movements of the two grippers, the force patterns on the two fingers of the same gripper can differ even when grasping a symmetric part of the tool. Such nuanced information is particularly important for in-hand object manipulation. Part (c) outlines our decision-making process, where our network takes the integrated 3D visuo-tactile representations as input and outputs the predicted action sequence.
  • Figure 4: Policy Rollout. We evaluate our visuo-tactile policy across four long-horizon, precise manipulation tasks. Detailed descriptions and metrics for these tasks can be found in Sec. \ref{['sec: exp setup']}. The first two rows emphasize tasks that require fine-grained force information, while the last two rows focus on tasks that require in-hand object state information. Please check videos on our https://binghao-huang.github.io/3D-ViTac/ for more details.
  • Figure 5: Tactile Feedback Improves the Demonstration Data Quality. Users new to the system perform better with both visual and tactile feedback than with vision alone.
  • ...and 9 more figures