Table of Contents
Fetching ...

Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models

Hao Ding, Lalithkumar Seenivasan, Hongchao Shu, Grayson Byrd, Han Zhang, Pu Xiao, Juan Antonio Barragan, Russell H. Taylor, Peter Kazanzides, Mathias Unberath

TL;DR

This work proposes an alternate perception approach -- a digital twin-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models and develops an embodied intelligence system that shows strong task performance and generalizability to varied environment settings.

Abstract

Large language model-based (LLM) agents are emerging as a powerful enabler of robust embodied intelligence due to their capability of planning complex action sequences. Sound planning ability is necessary for robust automation in many task domains, but especially in surgical automation. These agents rely on a highly detailed natural language representation of the scene. Thus, to leverage the emergent capabilities of LLM agents for surgical task planning, developing similarly powerful and robust perception algorithms is necessary to derive a detailed scene representation of the environment from visual input. Previous research has focused primarily on enabling LLM-based task planning while adopting simple yet severely limited perception solutions to meet the needs for bench-top experiments but lack the critical flexibility to scale to less constrained settings. In this work, we propose an alternate perception approach -- a digital twin-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models. Integrating our digital twin-based scene representation and LLM agent for planning with the dVRK platform, we develop an embodied intelligence system and evaluate its robustness in performing peg transfer and gauze retrieval tasks. Our approach shows strong task performance and generalizability to varied environment settings. Despite convincing performance, this work is merely a first step towards the integration of digital twin-based scene representations. Future studies are necessary for the realization of a comprehensive digital twin framework to improve the interpretability and generalizability of embodied intelligence in surgery.

Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models

TL;DR

This work proposes an alternate perception approach -- a digital twin-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models and develops an embodied intelligence system that shows strong task performance and generalizability to varied environment settings.

Abstract

Large language model-based (LLM) agents are emerging as a powerful enabler of robust embodied intelligence due to their capability of planning complex action sequences. Sound planning ability is necessary for robust automation in many task domains, but especially in surgical automation. These agents rely on a highly detailed natural language representation of the scene. Thus, to leverage the emergent capabilities of LLM agents for surgical task planning, developing similarly powerful and robust perception algorithms is necessary to derive a detailed scene representation of the environment from visual input. Previous research has focused primarily on enabling LLM-based task planning while adopting simple yet severely limited perception solutions to meet the needs for bench-top experiments but lack the critical flexibility to scale to less constrained settings. In this work, we propose an alternate perception approach -- a digital twin-based machine perception approach that capitalizes on the convincing performance and out-of-the-box generalization of recent vision foundation models. Integrating our digital twin-based scene representation and LLM agent for planning with the dVRK platform, we develop an embodied intelligence system and evaluate its robustness in performing peg transfer and gauze retrieval tasks. Our approach shows strong task performance and generalizability to varied environment settings. Despite convincing performance, this work is merely a first step towards the integration of digital twin-based scene representations. Future studies are necessary for the realization of a comprehensive digital twin framework to improve the interpretability and generalizability of embodied intelligence in surgery.
Paper Structure (25 sections, 3 figures, 2 tables)

This paper contains 25 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Illustration of the digital twin-based embodied surgical system. A machine perception module is applied to extract digital twin-based scene representation from the physical environment. An LLM-enabled embodied intelligence takes commands from a supervisor and makes high-level task plans based on the scene representation, prior knowledge, available actions, and previous actions and feedback. A robotic system receives commands from the embodied intelligence and executes them in the physical world. This embodied surgical system is implemented to automate peg transfer and gauze retrieval.
  • Figure 2: Illustration of the workflow of the proposed embodied surgical system with digital twin-based machine perception. The captured image is first processed via SAM2 ravi2024sam with initial point prompts for the objects of interest. The objects' identification, segmentation, raw image, and corresponding 3D models are processed via the FoundationPose model to predict 6DoF poses. The extracted information forms a digital twin-based scene representation and is further captured by embodied intelligence for task planning.
  • Figure 3: Illustration of physical setup and varied experimental environment.