Table of Contents
Fetching ...

From Decision to Action in Surgical Autonomy: Multi-Modal Large Language Models for Robot-Assisted Blood Suction

Sadra Zargarzadeh, Maryam Mirzaei, Yafei Ou, Mahdi Tavakoli

TL;DR

Autonomous decision-making in robot-assisted surgery faces safety, explainability, and adaptability challenges. The authors propose a two-level framework with a multi-modal LLM for high-level reasoning and planning and a DRL controller for low-level motion, tested in dynamic blood-flow scenarios to assess robustness. In simulation, the approach with context-augmented reasoning (LRWC) improves speed and consistency of suction decisions and better aligns with human decision-making, while leveraging visual cues such as tool proximity. This distributed-agency framework advances autonomous surgical subtask capabilities and informs pathways toward clinical validation, safety integration, and operator interfaces.

Abstract

The rise of Large Language Models (LLMs) has impacted research in robotics and automation. While progress has been made in integrating LLMs into general robotics tasks, a noticeable void persists in their adoption in more specific domains such as surgery, where critical factors such as reasoning, explainability, and safety are paramount. Achieving autonomy in robotic surgery, which entails the ability to reason and adapt to changes in the environment, remains a significant challenge. In this work, we propose a multi-modal LLM integration in robot-assisted surgery for autonomous blood suction. The reasoning and prioritization are delegated to the higher-level task-planning LLM, and the motion planning and execution are handled by the lower-level deep reinforcement learning model, creating a distributed agency between the two components. As surgical operations are highly dynamic and may encounter unforeseen circumstances, blood clots and active bleeding were introduced to influence decision-making. Results showed that using a multi-modal LLM as a higher-level reasoning unit can account for these surgical complexities to achieve a level of reasoning previously unattainable in robot-assisted surgeries. These findings demonstrate the potential of multi-modal LLMs to significantly enhance contextual understanding and decision-making in robotic-assisted surgeries, marking a step toward autonomous surgical systems.

From Decision to Action in Surgical Autonomy: Multi-Modal Large Language Models for Robot-Assisted Blood Suction

TL;DR

Autonomous decision-making in robot-assisted surgery faces safety, explainability, and adaptability challenges. The authors propose a two-level framework with a multi-modal LLM for high-level reasoning and planning and a DRL controller for low-level motion, tested in dynamic blood-flow scenarios to assess robustness. In simulation, the approach with context-augmented reasoning (LRWC) improves speed and consistency of suction decisions and better aligns with human decision-making, while leveraging visual cues such as tool proximity. This distributed-agency framework advances autonomous surgical subtask capabilities and informs pathways toward clinical validation, safety integration, and operator interfaces.

Abstract

The rise of Large Language Models (LLMs) has impacted research in robotics and automation. While progress has been made in integrating LLMs into general robotics tasks, a noticeable void persists in their adoption in more specific domains such as surgery, where critical factors such as reasoning, explainability, and safety are paramount. Achieving autonomy in robotic surgery, which entails the ability to reason and adapt to changes in the environment, remains a significant challenge. In this work, we propose a multi-modal LLM integration in robot-assisted surgery for autonomous blood suction. The reasoning and prioritization are delegated to the higher-level task-planning LLM, and the motion planning and execution are handled by the lower-level deep reinforcement learning model, creating a distributed agency between the two components. As surgical operations are highly dynamic and may encounter unforeseen circumstances, blood clots and active bleeding were introduced to influence decision-making. Results showed that using a multi-modal LLM as a higher-level reasoning unit can account for these surgical complexities to achieve a level of reasoning previously unattainable in robot-assisted surgeries. These findings demonstrate the potential of multi-modal LLMs to significantly enhance contextual understanding and decision-making in robotic-assisted surgeries, marking a step toward autonomous surgical systems.
Paper Structure (15 sections, 3 equations, 7 figures, 1 table)

This paper contains 15 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The high-level task reasoning and planning for the blood suction task is performed by the LLM, and the low-level motion planning and execution is done by the DRL agent.
  • Figure 2: System architecture.
  • Figure 3: An example of LLM reasoning with (LRWC) and without (LRWOC) context-based prompt augmentation. The guideline provided to the LLM is as context is as follows: Address active bleeding first, consider pool size next, and address the blood clot pool last, as coagulation ensures that flow in this pool has ceased and will not propagate further.
  • Figure 4: Simulation Environment 1. The LLM reasoning prioritizes suctioning the pools based on their size in the absence of surgical complexities such as active bleeding and blood clots as seen in (a)-(f).
  • Figure 5: Progression in blood suction in the four environments.
  • ...and 2 more figures