Table of Contents
Fetching ...

SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks

Masoud Moghani, Nigel Nelson, Mohamed Ghanem, Andres Diaz-Pinto, Kush Hari, Mahdi Azizian, Ken Goldberg, Sean Huver, Animesh Garg

TL;DR

SuFIA-BC tackles the challenge of learning accurate visuomotor policies for fine-grained, contact-rich surgical subtasks by leveraging a photorealistic surgical digital twin to generate synthetic data and evaluate behavior cloning backbones. The approach extends Orbit-Surgical with high-fidelity organ models and a teleoperation dataset to study perception strategies (RGB-D and point clouds) and policy backbones (ACT and diffusion). Key findings show that RGB-based perception with multiple cameras provides strong semantic grounding and generalization, while point-cloud perception offers greater viewpoint robustness but can struggle with object differentiation and geometry-specific generalization; diffusion and transformer-based action chunking show complementary strengths in handling temporal structure. The work advances data-efficient surgical autonomy research by highlighting the need for tailored perception pipelines and larger synthetic datasets and provides open-source data for further study.

Abstract

Behavior cloning facilitates the learning of dexterous manipulation skills, yet the complexity of surgical environments, the difficulty and expense of obtaining patient data, and robot calibration errors present unique challenges for surgical robot learning. We provide an enhanced surgical digital twin with photorealistic human anatomical organs, integrated into a comprehensive simulator designed to generate high-quality synthetic data to solve fundamental tasks in surgical autonomy. We present SuFIA-BC: visual Behavior Cloning policies for Surgical First Interactive Autonomy Assistants. We investigate visual observation spaces including multi-view cameras and 3D visual representations extracted from a single endoscopic camera view. Through systematic evaluation, we find that the diverse set of photorealistic surgical tasks introduced in this work enables a comprehensive evaluation of prospective behavior cloning models for the unique challenges posed by surgical environments. We observe that current state-of-the-art behavior cloning techniques struggle to solve the contact-rich and complex tasks evaluated in this work, regardless of their underlying perception or control architectures. These findings highlight the importance of customizing perception pipelines and control architectures, as well as curating larger-scale synthetic datasets that meet the specific demands of surgical tasks. Project website: https://orbit-surgical.github.io/sufia-bc/

SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks

TL;DR

SuFIA-BC tackles the challenge of learning accurate visuomotor policies for fine-grained, contact-rich surgical subtasks by leveraging a photorealistic surgical digital twin to generate synthetic data and evaluate behavior cloning backbones. The approach extends Orbit-Surgical with high-fidelity organ models and a teleoperation dataset to study perception strategies (RGB-D and point clouds) and policy backbones (ACT and diffusion). Key findings show that RGB-based perception with multiple cameras provides strong semantic grounding and generalization, while point-cloud perception offers greater viewpoint robustness but can struggle with object differentiation and geometry-specific generalization; diffusion and transformer-based action chunking show complementary strengths in handling temporal structure. The work advances data-efficient surgical autonomy research by highlighting the need for tailored perception pipelines and larger synthetic datasets and provides open-source data for further study.

Abstract

Behavior cloning facilitates the learning of dexterous manipulation skills, yet the complexity of surgical environments, the difficulty and expense of obtaining patient data, and robot calibration errors present unique challenges for surgical robot learning. We provide an enhanced surgical digital twin with photorealistic human anatomical organs, integrated into a comprehensive simulator designed to generate high-quality synthetic data to solve fundamental tasks in surgical autonomy. We present SuFIA-BC: visual Behavior Cloning policies for Surgical First Interactive Autonomy Assistants. We investigate visual observation spaces including multi-view cameras and 3D visual representations extracted from a single endoscopic camera view. Through systematic evaluation, we find that the diverse set of photorealistic surgical tasks introduced in this work enables a comprehensive evaluation of prospective behavior cloning models for the unique challenges posed by surgical environments. We observe that current state-of-the-art behavior cloning techniques struggle to solve the contact-rich and complex tasks evaluated in this work, regardless of their underlying perception or control architectures. These findings highlight the importance of customizing perception pipelines and control architectures, as well as curating larger-scale synthetic datasets that meet the specific demands of surgical tasks. Project website: https://orbit-surgical.github.io/sufia-bc/

Paper Structure

This paper contains 22 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview: (a) Photorealistic human anatomical organs and textures in Orbit-Surgical. (b-f) Visuomotor behavior cloning policies executing fine-grained robotic maneuvers performed during surgery and the hands-on training exercises used in tabletop surgical curricula: (b) Tissue Retraction, (c) Needle Lift, (d) Needle Handover, (e) Suture Pad, (f) Block Transfer.
  • Figure 2: Surgical digital twin: This workflow illustrates the full pipeline for creating photorealistic anatomical models, from raw CT volume data to final OpenUSD in Nvidia Omniverse. The process includes organ segmentation, mesh conversion, mesh cleaning and refinement, photorealistic texturing, and culminating in the assembly of all textured organs into a unified OpenUSD file.
  • Figure 3: Visual observations: An example of visual observations for the needle handover; (a) a static top down endoscope view, (b) downsampled point clouds capturing the arms and the suture needle -- the background is only shown for demonstration purposes and is not used during training for point cloud-based policies, (c-d) wrist camera views from each arm.
  • Figure 4: Sample efficiency in simulation: Each task's 50 demonstrations are subsampled in increments of 10 to evaluate the models' sample efficiency. Success rates are calculated over 20 trial runs at test time.
  • Figure 5: Viewpoint robustness: Performance of models trained on the primary camera views (train) is evaluated against small camera perturbations (view 1) and major viewpoint changes (view 2). Success rates are calculated over 20 trial runs at test time. ACT - S, ACT - M, and ACT - PC denote ACT - Single Camera, ACT - Multi Camera, and ACT - Point Cloud, respectively.