Table of Contents
Fetching ...

ManiSkill-ViTac 2025: Challenge on Manipulation Skill Learning With Vision and Tactile Sensing

Chuanyu Li, Renjun Dang, Xiang Li, Zhiyuan Wu, Jing Xu, Hamidreza Kasaei, Roberto Calandra, Nathan Lepora, Shan Luo, Hao Su, Rui Chen

TL;DR

The ManiSkill-ViTac Challenge 2025 is introduced, which focuses on learning contact-rich manipulation skills using both tactile and visual sensing, and includes 3 independent tracks: tactile manipulation, tactile-vision fusion manipulation, and tactile sensor structure design.

Abstract

This article introduces the ManiSkill-ViTac Challenge 2025, which focuses on learning contact-rich manipulation skills using both tactile and visual sensing. Expanding upon the 2024 challenge, ManiSkill-ViTac 2025 includes 3 independent tracks: tactile manipulation, tactile-vision fusion manipulation, and tactile sensor structure design. The challenge aims to push the boundaries of robotic manipulation skills, emphasizing the integration of tactile and visual data to enhance performance in complex, real-world tasks. Participants will be evaluated using standardized metrics across both simulated and real-world environments, spurring innovations in sensor design and significantly advancing the field of vision-tactile fusion in robotics.

ManiSkill-ViTac 2025: Challenge on Manipulation Skill Learning With Vision and Tactile Sensing

TL;DR

The ManiSkill-ViTac Challenge 2025 is introduced, which focuses on learning contact-rich manipulation skills using both tactile and visual sensing, and includes 3 independent tracks: tactile manipulation, tactile-vision fusion manipulation, and tactile sensor structure design.

Abstract

This article introduces the ManiSkill-ViTac Challenge 2025, which focuses on learning contact-rich manipulation skills using both tactile and visual sensing. Expanding upon the 2024 challenge, ManiSkill-ViTac 2025 includes 3 independent tracks: tactile manipulation, tactile-vision fusion manipulation, and tactile sensor structure design. The challenge aims to push the boundaries of robotic manipulation skills, emphasizing the integration of tactile and visual data to enhance performance in complex, real-world tasks. Participants will be evaluated using standardized metrics across both simulated and real-world environments, spurring innovations in sensor design and significantly advancing the field of vision-tactile fusion in robotics.

Paper Structure

This paper contains 51 sections, 16 equations, 6 figures, 1 table, 3 algorithms.

Figures (6)

  • Figure 1: The left image shows the real-world experimental platform, consisting of a 3-axis translation stage, a rotary stage, two GelSight Mini sensors, an SRI M3813A 6DoF F/T Sensor, an Intel RealSense Depth Camera D415, and a parallel gripper (Robotiq Hand-E). The right image depicts the simulated scene, which only includes the silicone parts of the tactile sensors and the target objects for each task.
  • Figure 2: This is a simplified GelSight Mini model, where the edge points are fixed to apply displacement. After transforming the marker point into the camera frame, its pixel coordinates can be calculated using the pinhole model. The bottom right of the figure shows an illustration of interpolation using adjacent FEM vertices. $\bm{p_i}$ represents the $i$th marker and the facet it belongs to has 3 vertices: $\bm{x}_{s_{i1}}$, $\bm{x}_{s_{i2}}$ and $\bm{x}_{s_{i3}}$.
  • Figure 3: Frame and action direction in simulation.
  • Figure 4: The challenge workflow illustrates the progression through three independent tracks. Each track commences with Stage 1. Tracks 1 and 2 feature an additional Stage 2, where the number of teams advancing to this stage is determined by the total number of registrations. Track 3, on the other hand, proceeds directly from Stage 1 to the winner selection. At the conclusion of the challenge, winners from all tracks are recognized and awarded.
  • Figure 5: The diagram illustrates the actor networks for three tracks. In Tracks 1 and 3, marker flows obtained from two tactile sensors, along with the relative motion, are fed into the Encoder Network. The resulting features are concatenated and subsequently input into the MLP Policy Network to produce action outputs. In Track 2, inputs include the peg point cloud, hole point cloud, and marker flows from both left and right tactile sensors; these inputs are processed through the Encoder Network. The outputs from this network are concatenated and then input into the MLP Policy Network to generate action outputs.
  • ...and 1 more figures