Table of Contents
Fetching ...

VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

Zhaoliang Wan, Yonggen Ling, Senlin Yi, Lu Qi, Wangwei Lee, Minglei Lu, Sicheng Yang, Xiao Teng, Peng Lu, Xu Yang, Ming-Hsuan Yang, Hui Cheng

TL;DR

VinT-6D is introduced, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation, and a benchmark method is presented that shows significant improvements in performance by fusing multi-modal information.

Abstract

This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the ``Perception-Planning-Control" paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real splits, collected via simulations in MuJoCo and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so that it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present a benchmark method that shows significant improvements in performance by fusing multi-modal information. The project is available at https://VinT-6D.github.io/.

VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

TL;DR

VinT-6D is introduced, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation, and a benchmark method is presented that shows significant improvements in performance by fusing multi-modal information.

Abstract

This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the ``Perception-Planning-Control" paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real splits, collected via simulations in MuJoCo and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so that it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present a benchmark method that shows significant improvements in performance by fusing multi-modal information. The project is available at https://VinT-6D.github.io/.
Paper Structure (44 sections, 4 equations, 20 figures, 6 tables)

This paper contains 44 sections, 4 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Large-scale object-in-hand dataset VinT-6D comprising synthesized and real-world splits naming VinT-Sim and VinT-Real. VinT-Sim aims to generate realistic data across vision, touch, and proprioception. VinT-Real is gathered through a precisely calibrated and aligned multi-modal robot platform, where a motion capture system obtains accurate object and hand poses.
  • Figure 2: VinT-Sim Dataset Generation Pipeline. VinT-Sim requires a robotic hand as input, which can have either three or four fingers along with an object model. There are three components involved in this process: (1) Simulating whole-hand touch. (2) Generating tactile data and proprioception information through object-grasp interactions. (3) Rendering each object-grasp scene with various realistic backgrounds and capturing multiple views.
  • Figure 3: Simulated and Real-world Robotic Hands with Whole-Hand Tactile Perception. In VinT-6D, both the three-fingered Trx hand and the four-fingered Allegro hand are used to generate or collect datasets. These robotic hands are equipped with array-based tactile sensors covering the entire hand, with the simulated sensors distributed similarly to the real-world setup.
  • Figure 4: Visualization of full tactile points aligned with vision in VinT-Real. Gray points represent the point cloud from the depth camera, blue points depict the transformed model from the motion capture system, and red points indicate the full touch points of the hand.
  • Figure 5: Custom-Developed Robot Platform. The objects selected in VinT-Real are neatly placed on the desk for easy access and manipulation.
  • ...and 15 more figures