UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation Data Generation, Learning, and Benchmarking
Baijun Chen, Weijie Wan, Tianxing Chen, Xianda Guo, Congsheng Xu, Yuanyang Qi, Haojie Zhang, Longyan Wu, Tianling Xu, Zixuan Li, Yizhe Wu, Rui Li, Xiaokang Yang, Ping Luo, Wei Sui, Yao Mu
TL;DR
UniVTAC introduces a unified, simulation-based framework for visuo-tactile manipulation that combines scalable data synthesis for three sensors, a tactile-centric encoder trained with multi-task supervisory signals, and an eight-task benchmark to evaluate tactile-driven policies. The UniVTAC Encoder learns representations that fuse shape, contact deformation, and pose information, enabling improved performance on tactile-dependent tasks and robust sim-to-real transfer. Empirical results show a 17.1 percentage-point average gain on the UniVTAC Benchmark and a 25% real-world improvement when deploying the encoder, validating the practicality of simulation-synthesized tactile data for real robotic manipulation. The framework offers a scalable foundation for future expansion to additional sensors, dynamic interactions, and open-world scenarios.
Abstract
Robotic manipulation has seen rapid progress with vision-language-action (VLA) policies. However, visuo-tactile perception is critical for contact-rich manipulation, as tasks such as insertion are difficult to complete robustly using vision alone. At the same time, acquiring large-scale and reliable tactile data in the physical world remains costly and challenging, and the lack of a unified evaluation platform further limits policy learning and systematic analysis. To address these challenges, we propose UniVTAC, a simulation-based visuo-tactile data synthesis platform that supports three commonly used visuo-tactile sensors and enables scalable and controllable generation of informative contact interactions. Based on this platform, we introduce the UniVTAC Encoder, a visuo-tactile encoder trained on large-scale simulation-synthesized data with designed supervisory signals, providing tactile-centric visuo-tactile representations for downstream manipulation tasks. In addition, we present the UniVTAC Benchmark, which consists of eight representative visuo-tactile manipulation tasks for evaluating tactile-driven policies. Experimental results show that integrating the UniVTAC Encoder improves average success rates by 17.1% on the UniVTAC Benchmark, while real-world robotic experiments further demonstrate a 25% improvement in task success. Our webpage is available at https://univtac.github.io/.
