Table of Contents
Fetching ...

UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation Data Generation, Learning, and Benchmarking

Baijun Chen, Weijie Wan, Tianxing Chen, Xianda Guo, Congsheng Xu, Yuanyang Qi, Haojie Zhang, Longyan Wu, Tianling Xu, Zixuan Li, Yizhe Wu, Rui Li, Xiaokang Yang, Ping Luo, Wei Sui, Yao Mu

TL;DR

UniVTAC introduces a unified, simulation-based framework for visuo-tactile manipulation that combines scalable data synthesis for three sensors, a tactile-centric encoder trained with multi-task supervisory signals, and an eight-task benchmark to evaluate tactile-driven policies. The UniVTAC Encoder learns representations that fuse shape, contact deformation, and pose information, enabling improved performance on tactile-dependent tasks and robust sim-to-real transfer. Empirical results show a 17.1 percentage-point average gain on the UniVTAC Benchmark and a 25% real-world improvement when deploying the encoder, validating the practicality of simulation-synthesized tactile data for real robotic manipulation. The framework offers a scalable foundation for future expansion to additional sensors, dynamic interactions, and open-world scenarios.

Abstract

Robotic manipulation has seen rapid progress with vision-language-action (VLA) policies. However, visuo-tactile perception is critical for contact-rich manipulation, as tasks such as insertion are difficult to complete robustly using vision alone. At the same time, acquiring large-scale and reliable tactile data in the physical world remains costly and challenging, and the lack of a unified evaluation platform further limits policy learning and systematic analysis. To address these challenges, we propose UniVTAC, a simulation-based visuo-tactile data synthesis platform that supports three commonly used visuo-tactile sensors and enables scalable and controllable generation of informative contact interactions. Based on this platform, we introduce the UniVTAC Encoder, a visuo-tactile encoder trained on large-scale simulation-synthesized data with designed supervisory signals, providing tactile-centric visuo-tactile representations for downstream manipulation tasks. In addition, we present the UniVTAC Benchmark, which consists of eight representative visuo-tactile manipulation tasks for evaluating tactile-driven policies. Experimental results show that integrating the UniVTAC Encoder improves average success rates by 17.1% on the UniVTAC Benchmark, while real-world robotic experiments further demonstrate a 25% improvement in task success. Our webpage is available at https://univtac.github.io/.

UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation Data Generation, Learning, and Benchmarking

TL;DR

UniVTAC introduces a unified, simulation-based framework for visuo-tactile manipulation that combines scalable data synthesis for three sensors, a tactile-centric encoder trained with multi-task supervisory signals, and an eight-task benchmark to evaluate tactile-driven policies. The UniVTAC Encoder learns representations that fuse shape, contact deformation, and pose information, enabling improved performance on tactile-dependent tasks and robust sim-to-real transfer. Empirical results show a 17.1 percentage-point average gain on the UniVTAC Benchmark and a 25% real-world improvement when deploying the encoder, validating the practicality of simulation-synthesized tactile data for real robotic manipulation. The framework offers a scalable foundation for future expansion to additional sensors, dynamic interactions, and open-world scenarios.

Abstract

Robotic manipulation has seen rapid progress with vision-language-action (VLA) policies. However, visuo-tactile perception is critical for contact-rich manipulation, as tasks such as insertion are difficult to complete robustly using vision alone. At the same time, acquiring large-scale and reliable tactile data in the physical world remains costly and challenging, and the lack of a unified evaluation platform further limits policy learning and systematic analysis. To address these challenges, we propose UniVTAC, a simulation-based visuo-tactile data synthesis platform that supports three commonly used visuo-tactile sensors and enables scalable and controllable generation of informative contact interactions. Based on this platform, we introduce the UniVTAC Encoder, a visuo-tactile encoder trained on large-scale simulation-synthesized data with designed supervisory signals, providing tactile-centric visuo-tactile representations for downstream manipulation tasks. In addition, we present the UniVTAC Benchmark, which consists of eight representative visuo-tactile manipulation tasks for evaluating tactile-driven policies. Experimental results show that integrating the UniVTAC Encoder improves average success rates by 17.1% on the UniVTAC Benchmark, while real-world robotic experiments further demonstrate a 25% improvement in task success. Our webpage is available at https://univtac.github.io/.
Paper Structure (22 sections, 5 equations, 7 figures, 5 tables)

This paper contains 22 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The UniVTAC Encoder Framework and Integration into Policy Learning. (a) The UniVTAC Encoder is pretrained with three self-supervised objectives, including shape reconstruction, contact deformation prediction, and object pose regression, to learn a structured tactile-centric representation from raw visuo-tactile observations. (b) At deployment time, the pretrained encoder is integrated as a perception module for downstream manipulation policies, enabling end-to-end policy learning from raw tactile images without introducing additional inference-time overhead.
  • Figure 2: Reconstruction Result. From a tactile image with markers, the UniVTAC Encoder reconstructs complementary physical signals, including the marker-free tactile image, gelpad deformation depth map, and marker point positions, across diverse contact geometries. These results show that the learned representation captures both global shape cues and fine-grained contact deformation beyond sensor-specific visual patterns.
  • Figure 3: UniVTAC Benchmark Tasks. The UniVTAC Benchmark comprises eight representative visuo-tactile manipulation tasks spanning shape recognition, pose reasoning, and contact-rich interaction, and is designed to systematically evaluate tactile-dependent manipulation policies. For each task, we visualize two representative key frames corresponding to critical stages of execution. Each key frame includes both a visuo-tactile observation and a standard visual observation. For clarity of presentation, we display the tactile observation from only one side of the gripper, although tactile sensing is available on both fingertips during execution.
  • Figure 4: Impact of Pretraining Data Scale on Encoder Effectiveness. Downstream policy performance improves consistently with increasing amounts of synthetic tactile data, highlighting the benefits of large-scale simulated experiences for representation learning.
  • Figure 5: Real-world Task Key Frames. Representative key frames from three real-world visuo-tactile manipulation tasks, showing synchronized wrist RGB images (left) and marker-based tactile observations (right) at the initial approach, contact-rich interaction, and final completion stages. The intermediate frames highlight evolving contact states and deformation cues that support fine-grained alignment and correction beyond vision-only perception.
  • ...and 2 more figures