Table of Contents
Fetching ...

World Models for General Surgical Grasping

Hongbin Lin, Bin Li, Chun Wai Wong, Juan Rojas, Xiangyu Chu, Kwok Wai Samuel Au

TL;DR

This work addresses the fragility of pose-estimation strategies in surgical grasping by proposing GAS, a world-model–based framework that learns a pixel-level visuomotor policy capable of grasping unseen surgical objects. GAS integrates video object segmentation, depth-imputation for imprecise regions, and a compact Dynamic Spotlight Adaptation representation, augmented by Virtual Clutch, domain randomization, and FSM-driven rewards to enable robust sim-to-real transfer. Empirical results show GAS achieving an average real-robot success of $69\%$, with high generalization to unseen objects and resilience to six disturbance types; in simulation, GAS reaches $87\%$ SR, significantly outperforming PPO and DreamerV2 baselines. The approach promises practical impact by enabling general, robust robotic grasping in complex surgery environments without extensive per-task hand-tuning.

Abstract

Intelligent vision control systems for surgical robots should adapt to unknown and diverse objects while being robust to system disturbances. Previous methods did not meet these requirements due to mainly relying on pose estimation and feature tracking. We propose a world-model-based deep reinforcement learning framework "Grasp Anything for Surgery" (GAS), that learns a pixel-level visuomotor policy for surgical grasping, enhancing both generality and robustness. In particular, a novel method is proposed to estimate the values and uncertainties of depth pixels for a rigid-link object's inaccurate region based on the empirical prior of the object's size; both depth and mask images of task objects are encoded to a single compact 3-channel image (size: 64x64x3) by dynamically zooming in the mask regions, minimizing the information loss. The learned controller's effectiveness is extensively evaluated in simulation and in a real robot. Our learned visuomotor policy handles: i) unseen objects, including 5 types of target grasping objects and a robot gripper, in unstructured real-world surgery environments, and ii) disturbances in perception and control. Note that we are the first work to achieve a unified surgical control system that grasps diverse surgical objects using different robot grippers on real robots in complex surgery scenes (average success rate: 69%). Our system also demonstrates significant robustness across 6 conditions including background variation, target disturbance, camera pose variation, kinematic control error, image noise, and re-grasping after the gripped target object drops from the gripper. Videos and codes can be found on our project page: https://linhongbin.github.io/gas/.

World Models for General Surgical Grasping

TL;DR

This work addresses the fragility of pose-estimation strategies in surgical grasping by proposing GAS, a world-model–based framework that learns a pixel-level visuomotor policy capable of grasping unseen surgical objects. GAS integrates video object segmentation, depth-imputation for imprecise regions, and a compact Dynamic Spotlight Adaptation representation, augmented by Virtual Clutch, domain randomization, and FSM-driven rewards to enable robust sim-to-real transfer. Empirical results show GAS achieving an average real-robot success of , with high generalization to unseen objects and resilience to six disturbance types; in simulation, GAS reaches SR, significantly outperforming PPO and DreamerV2 baselines. The approach promises practical impact by enabling general, robust robotic grasping in complex surgery environments without extensive per-task hand-tuning.

Abstract

Intelligent vision control systems for surgical robots should adapt to unknown and diverse objects while being robust to system disturbances. Previous methods did not meet these requirements due to mainly relying on pose estimation and feature tracking. We propose a world-model-based deep reinforcement learning framework "Grasp Anything for Surgery" (GAS), that learns a pixel-level visuomotor policy for surgical grasping, enhancing both generality and robustness. In particular, a novel method is proposed to estimate the values and uncertainties of depth pixels for a rigid-link object's inaccurate region based on the empirical prior of the object's size; both depth and mask images of task objects are encoded to a single compact 3-channel image (size: 64x64x3) by dynamically zooming in the mask regions, minimizing the information loss. The learned controller's effectiveness is extensively evaluated in simulation and in a real robot. Our learned visuomotor policy handles: i) unseen objects, including 5 types of target grasping objects and a robot gripper, in unstructured real-world surgery environments, and ii) disturbances in perception and control. Note that we are the first work to achieve a unified surgical control system that grasps diverse surgical objects using different robot grippers on real robots in complex surgery scenes (average success rate: 69%). Our system also demonstrates significant robustness across 6 conditions including background variation, target disturbance, camera pose variation, kinematic control error, image noise, and re-grasping after the gripped target object drops from the gripper. Videos and codes can be found on our project page: https://linhongbin.github.io/gas/.
Paper Structure (32 sections, 6 equations, 10 figures, 4 tables)

This paper contains 32 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Generality, robustness, and evaluated environments of our visuomotor controller. (a) The success rates of our visuomotor controller for unseen objects, including 5 types of grasping objects and a robot gripper, are demonstrated. (b) We also show the controller's robustness across 6 conditions including background variation, target disturbance, camera pose variation, kinematic control error, image noise, and re-grasping after the gripped target object drops from the gripper. (c) A robot gripper is actuated by the controller to grasp a needle on a rectangle phantom (left bottom) and a liver phantom (right bottom).
  • Figure 2: Diverse target objects and robot grippers in surgical grasping.
  • Figure 3: An overview of Grasp Anything for Surgery (GAS). Visual observations are processed by our proposed video processing, including video object segmentation, depth estimation, and Dynamic Spotlight Adaptation (DSA). A visuomotor policy, learned by world models, leverages the processed observations as the input. The predicted action of the policy, further processed by Virtual Clutch (VC), actuates the robot gripper to grasp target objects in simulation and a real robot. Furthermore, domain randomization is applied in simulation for visuomotor learning.
  • Figure 4: Schematic illustration and real-world visualization of our uncertainty-aware depth estimation. (a) We show a schematic illustration of our uncertainty-aware depth estimation for a two-link robot arm on a 2D plane. The area of significant depth noise occurs due to the sensing principle of a structured light depth camera. Depth in noisy area is estimated as the depth median in the ground-truth area. The uncertainty is the minimal diameter of the spatial bound for an arbitrary object's configuration (the sum of the lengths of two links in this example). (b) We show the observed real-world RGB-D images and the mask for a gripper tip in general surgical grasping. The original depth in the gripper tip region is prone to be inaccurate (see top right). The depth pixels in the region of the gripper tip are calculated with our depth estimation (see bottom right).
  • Figure 5: Pipeline of Dynamic Spotlight Adaptation (DSA) for visual representation of world models in general surgical grasping. The visual masks, encoding matrices of mask ID, and the estimated depth image are the inputs of our visual representation. The global layer (red channel) is obtained by generating square masks and a zoom-in mask based on visual masks, followed by summing these masks with encoding values. The mask (green channel) and the depth layers (blue channel) are obtained by zooming in a segmented mask image and a segmented depth image, respectively, which can be obtained by multiplication with visual masks. We stack three layers into a 3D matrix and downsample it to a 64x64x3 compact image for world model learning. Two scalar signals, i.e. the task-level state and the gripper toggling state, are encoded into two 6x6 square images, which are used to replace pixels of the downsampled image at the bottom right corner of the global layer (red channel).
  • ...and 5 more figures