Table of Contents
Fetching ...

dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale

Yihao Liu, Yu-Chun Ku, Jiaming Zhang, Hao Ding, Peter Kazanzides, Mehran Armand

TL;DR

The paper tackles data scarcity in RMIS by introducing dARt Vinci, an egocentric AR data collection platform that uses a high-fidelity simulator to collect teleoperation demonstrations without a physical robot. It integrates AR hand tracking with a neural-inference-ready pipeline, mapping hand gestures to da Vinci PSM commands and recording compact JSON state data that can be replayed in IsaacSim. Ten primitive RMIS tasks are used to benchmark data collection efficiency, and a user study shows 41% higher data throughput, 10% shorter experiment times, and 400x reduction in storage, with a doubling of sampling frequency. The work enables scalable data collection for imitation and reinforcement learning in surgical robotics, reducing hardware barriers and enabling broader participation.

Abstract

Data scarcity has long been an issue in the robot learning community. Particularly, in safety-critical domains like surgical applications, obtaining high-quality data can be especially difficult. It poses challenges to researchers seeking to exploit recent advancements in reinforcement learning and imitation learning, which have greatly improved generalizability and enabled robots to conduct tasks autonomously. We introduce dARt Vinci, a scalable data collection platform for robot learning in surgical settings. The system uses Augmented Reality (AR) hand tracking and a high-fidelity physics engine to capture subtle maneuvers in primitive surgical tasks: By eliminating the need for a physical robot setup and providing flexibility in terms of time, space, and hardware resources-such as multiview sensors and actuators-specialized simulation is a viable alternative. At the same time, AR allows the robot data collection to be more egocentric, supported by its body tracking and content overlaying capabilities. Our user study confirms the proposed system's efficiency and usability, where we use widely-used primitive tasks for training teleoperation with da Vinci surgical robots. Data throughput improves across all tasks compared to real robot settings by 41% on average. The total experiment time is reduced by an average of 10%. The temporal demand in the task load survey is improved. These gains are statistically significant. Additionally, the collected data is over 400 times smaller in size, requiring far less storage while achieving double the frequency.

dARt Vinci: Egocentric Data Collection for Surgical Robot Learning at Scale

TL;DR

The paper tackles data scarcity in RMIS by introducing dARt Vinci, an egocentric AR data collection platform that uses a high-fidelity simulator to collect teleoperation demonstrations without a physical robot. It integrates AR hand tracking with a neural-inference-ready pipeline, mapping hand gestures to da Vinci PSM commands and recording compact JSON state data that can be replayed in IsaacSim. Ten primitive RMIS tasks are used to benchmark data collection efficiency, and a user study shows 41% higher data throughput, 10% shorter experiment times, and 400x reduction in storage, with a doubling of sampling frequency. The work enables scalable data collection for imitation and reinforcement learning in surgical robotics, reducing hardware barriers and enabling broader participation.

Abstract

Data scarcity has long been an issue in the robot learning community. Particularly, in safety-critical domains like surgical applications, obtaining high-quality data can be especially difficult. It poses challenges to researchers seeking to exploit recent advancements in reinforcement learning and imitation learning, which have greatly improved generalizability and enabled robots to conduct tasks autonomously. We introduce dARt Vinci, a scalable data collection platform for robot learning in surgical settings. The system uses Augmented Reality (AR) hand tracking and a high-fidelity physics engine to capture subtle maneuvers in primitive surgical tasks: By eliminating the need for a physical robot setup and providing flexibility in terms of time, space, and hardware resources-such as multiview sensors and actuators-specialized simulation is a viable alternative. At the same time, AR allows the robot data collection to be more egocentric, supported by its body tracking and content overlaying capabilities. Our user study confirms the proposed system's efficiency and usability, where we use widely-used primitive tasks for training teleoperation with da Vinci surgical robots. Data throughput improves across all tasks compared to real robot settings by 41% on average. The total experiment time is reduced by an average of 10%. The temporal demand in the task load survey is improved. These gains are statistically significant. Additionally, the collected data is over 400 times smaller in size, requiring far less storage while achieving double the frequency.

Paper Structure

This paper contains 12 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Workflow comparison between using the da Vinci Research Kit (dVRK) kazanzides2014open (top) vs the proposed system (bottom) to collect surgical robot learning data. The real robot needs to be set up in the working environment. The preparation includes but is not limited to the calibration processes to obtain the corresponding transformations from the Endoscopic Camera Manipulator (ECM) to the Patient Side Manipulator (PSM). Data collection is then performed by the participants sitting at the teleoperation console. In the proposed approach, the participant only needs a headset to track the hand gestures and a PC to run a high-fidelity simulator.
  • Figure 2: The primitive surgical tasks using dVRK. The center of the Fig. contains the complete view of the dVRK (robot base omitted) while executing peg transfer and needle passing tasks. Panels (a)-(j) demonstrate ten primitive surgical tasks commonly seen in benchmarking surgical robot learning and recognizing surgical gestures ahmidi2017datasethwang2022automatingyu2024orbit. They are (a) Reach, (b) Reach with Obstacles, (c) Dual Arm Reach, (d) Dual Arm Reach with Obstacles, (e) Suture Needle Lift, (f) Needle Handover, (g) Peg Block Lift, (h) Pick and Transfer, (i) Pick and Place, and (j) Needle Pass Ring. In each panel, from left to right, the views are real robot view, AR headset view while overlaying grippers on hands, and simulator view.
  • Figure 3: The architecture of the proposed data collection platform. There are three major components in addition to the robot to be used in the experiments: The AR headset is used to visualize the maneuver and track the hand gesture, the data server processes and stores the collected data and passes the end-effector pose to the physics engine, and the physics engine handles the simulation and returns back the relevant states after an update of a simulation frame. The vision player replays the saved data to obtain the video data, and this data is used for robot learning. The real robot can also be connected to the data server for data replaying.
  • Figure 4: The action mapping from the hand gesture to the PSM manipulator. The system follows the OpenXR Hand Skeleton convention. The distance between the tips of the index finger and the thumb is used to map to the opening and closing of the PSM gripper. The end-effector is anchored on the tip of the thumb whose pose is used in the inverse kinematics solver to derive the joint angles of the PSM arm joints. The yaw joint of the gripper is used as the end of the manipulator.
  • Figure 5: Scaling of the scene entities. The left is a photo to illustrate the actual size of the hand, instruments, and peg board, whereas the right image shows the AR view, where virtual objects are scaled 6x of the true size such that subtle actions can be performed and tolerance for error is increased.
  • ...and 2 more figures