Table of Contents
Fetching ...

Learning Surgical Robotic Manipulation with 3D Spatial Priors

Yu Sheng, Lidian Wang, Xiaomeng Chu, Jiajun Deng, Min Cheng, Yanyong Zhang, Bei Hua, Houqiang Li, Jianmin Ji

TL;DR

The Spatial Surgical Transformer (SST) is introduced, an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images.

Abstract

Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a large-scale photorealistic dataset containing 30K stereo endoscopic image pairs with accurate 3D geometry, addressing the scarcity of 3D data in surgical scenes. Based on Surgical3D, we finetune a powerful geometric transformer to extract robust 3D latent representations from stereo endoscopes images. These representations are then seamlessly aligned with the robot's action space via a lightweight multi-level spatial feature connector (MSFC), all within an endoscope-centric coordinate frame. Extensive real-robot experiments demonstrate that SST achieves state-of-the-art performance and strong spatial generalization on complex surgical tasks such as knot tying and ex-vivo organ dissection, representing a significant step toward practical clinical deployment. The dataset and code will be released.

Learning Surgical Robotic Manipulation with 3D Spatial Priors

TL;DR

The Spatial Surgical Transformer (SST) is introduced, an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images.

Abstract

Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a large-scale photorealistic dataset containing 30K stereo endoscopic image pairs with accurate 3D geometry, addressing the scarcity of 3D data in surgical scenes. Based on Surgical3D, we finetune a powerful geometric transformer to extract robust 3D latent representations from stereo endoscopes images. These representations are then seamlessly aligned with the robot's action space via a lightweight multi-level spatial feature connector (MSFC), all within an endoscope-centric coordinate frame. Extensive real-robot experiments demonstrate that SST achieves state-of-the-art performance and strong spatial generalization on complex surgical tasks such as knot tying and ex-vivo organ dissection, representing a significant step toward practical clinical deployment. The dataset and code will be released.
Paper Structure (15 sections, 4 equations, 6 figures, 3 tables)

This paper contains 15 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We present Spatial Surgical Transformer (SST), a visuomotor policy that empowers surgical robots with spatial intelligence through learned 3D spatial priors. The policy leverages a geometry transformer finetuned on the proposed Surgical3D Dataset to extract robust 3D latent embeddings from stereo endoscopic inputs, coupled with a multi-level spatial feature connector that integrates multi-level 3D latent embeddings capturing both fine-grained details and global context into the policy decoder. We implement SST on a real surgical robot equipped with a stereo endoscopic camera manipulator (ECM) and two patient-side manipulators (PSMs), and evaluate it across three distinct real-world surgical scenes, where it achieves state-of-the-art performance.
  • Figure 2: Pipeline of Our Method.Top: The geometry transformer is first finetuned on the proposed Surgical3D dataset using a 3D reconstruction objective, enabling the extraction of robust 3D latent embeddings from endoscopic images. Bottom: The geometry transformer is then frozen, while the remaining components are trained to learn surgical manipulation policies with spatial priors from collected demonstrations. A multi-level spatial feature connector (MSFC) is trained to aggregate 3D latent embeddings from multiple geometry transformer blocks and aligns them with the robot’s action space. An endoscope-centric policy decoder generates relative robot actions in the endoscope frame, guided by the learned 3D spatial priors.
  • Figure 3: Left: Samples from our Surgical3D dataset. We utilize diverse 3D surgical assets to generate highly realistic and varied synthetic surgical scenes. Right: The figure illustrates the reconstruction results from MASt3R under three finetuning configurations. (a) One example of a stereo endoscopic image captured in a real in-vivo surgical scene, used as input. (b) The original MASt3R fails to reconstruct both the patient-side manipulator (PSM) and the organ surface. (c) When finetuned solely on synthetic data, the organ reconstruction remains coarse and the PSM geometry is still incomplete. (d) When finetuned on a combination of synthetic and real data, the model achieves more accurate reconstruction of PSM instruments and better organ completeness.
  • Figure 4: Visualization of Experimental Settings. Yellow and blue arrows denote the approximate motion directions of the right and left arms, respectively. Top: Peg pickup task. A total of 180 trajectories were collected, with roughly 120 in the green region and 60 in the blue region for training. Test1 and Test2 correspond to evaluations in these respective areas. Middle: Knot tying task. The suture tail was randomly positioned within the blue region during data collection, and evaluations were performed under the same condition. (b1) and (b3) show the grasping and looping actions. Bottom: Ex-vivo gallbladder dissection task. Grasp points on the gallbladder were sampled from varying positions within the blue region during data collection, and all methods were evaluated under the same setup. (c1) and (c3) depict the grasping and cutting actions. Supplementary videos provide detailed views of the task execution.
  • Figure 5: Qualitative Results of Intermediate 3D Reconstruction. Subscripts at the lower left of each image indicate the manipulation step. Tokens from the finetuned geometry transformer are fed into DPT heads to generate intermediate 3D reconstructions. The results demonstrate that the geometry transformer effectively extracts 3D cues from current endoscopic observations across various test tasks without task-specific training. More details are provided in the supplementary videos.
  • ...and 1 more figures