Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models

Hao Ding; Lalithkumar Seenivasan; Hongchao Shu; Grayson Byrd; Han Zhang; Pu Xiao; Juan Antonio Barragan; Russell H. Taylor; Peter Kazanzides; Mathias Unberath

Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models

Hao Ding, Lalithkumar Seenivasan, Hongchao Shu, Grayson Byrd, Han Zhang, Pu Xiao, Juan Antonio Barragan, Russell H. Taylor, Peter Kazanzides, Mathias Unberath

TL;DR

This work proposes an alternate perception approach -- a digital twin-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models and develops an embodied intelligence system that shows strong task performance and generalizability to varied environment settings.

Abstract

Large language model-based (LLM) agents are emerging as a powerful enabler of robust embodied intelligence due to their capability of planning complex action sequences. Sound planning ability is necessary for robust automation in many task domains, but especially in surgical automation. These agents rely on a highly detailed natural language representation of the scene. Thus, to leverage the emergent capabilities of LLM agents for surgical task planning, developing similarly powerful and robust perception algorithms is necessary to derive a detailed scene representation of the environment from visual input. Previous research has focused primarily on enabling LLM-based task planning while adopting simple yet severely limited perception solutions to meet the needs for bench-top experiments but lack the critical flexibility to scale to less constrained settings. In this work, we propose an alternate perception approach -- a digital twin-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models. Integrating our digital twin-based scene representation and LLM agent for planning with the dVRK platform, we develop an embodied intelligence system and evaluate its robustness in performing peg transfer and gauze retrieval tasks. Our approach shows strong task performance and generalizability to varied environment settings. Despite convincing performance, this work is merely a first step towards the integration of digital twin-based scene representations. Future studies are necessary for the realization of a comprehensive digital twin framework to improve the interpretability and generalizability of embodied intelligence in surgery.

Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models

TL;DR

Abstract

Paper Structure (25 sections, 3 figures, 2 tables)

This paper contains 25 sections, 3 figures, 2 tables.

Introduction
Related Work
Machine Perception in Surgical Automation
Foundation Models for Perception
Language-based Automation
Method
Preliminaries
Segment Anything Model 2 (SAM2)
FoundationPose
Embodied Surgical System Overview
Digital Twin-based Machine Perception
Digital twin-based scene representation
Perception Workflow
Robotic Control System
Embodied intelligence
...and 10 more sections

Figures (3)

Figure 1: Illustration of the digital twin-based embodied surgical system. A machine perception module is applied to extract digital twin-based scene representation from the physical environment. An LLM-enabled embodied intelligence takes commands from a supervisor and makes high-level task plans based on the scene representation, prior knowledge, available actions, and previous actions and feedback. A robotic system receives commands from the embodied intelligence and executes them in the physical world. This embodied surgical system is implemented to automate peg transfer and gauze retrieval.
Figure 2: Illustration of the workflow of the proposed embodied surgical system with digital twin-based machine perception. The captured image is first processed via SAM2 ravi2024sam with initial point prompts for the objects of interest. The objects' identification, segmentation, raw image, and corresponding 3D models are processed via the FoundationPose model to predict 6DoF poses. The extracted information forms a digital twin-based scene representation and is further captured by embodied intelligence for task planning.
Figure 3: Illustration of physical setup and varied experimental environment.

Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models

TL;DR

Abstract

Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)