Table of Contents
Fetching ...

SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence

Chang Han Low, Ziyue Wang, Tianyi Zhang, Zhitao Zeng, Zhu Zhuo, Evangelos B. Mazomenos, Yueming Jin

TL;DR

The paper tackles unreliable surgical scene interpretation due to hallucinations and domain gaps in Vision-Language Models. It introduces SurgRAW, a hierarchical, multi-agent framework where task-specific Chain-of-Thought prompts guide CoT-embedded VLM agents across five tasks, complemented by Retrieval-Augmented Generation and a panel-discussion mechanism for grounding and consistency. A new SurgCoTBench dataset provides frame-level, reasoning-focused evaluation across robotic procedures. Experiments show substantial accuracy improvements, outperforming baselines and establishing state-of-the-art performance for explainable, autonomous surgical assistance.

Abstract

Integration of Vision-Language Models (VLMs) in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies within surgical scenes, undermining clinical reliability. While recent VLMs demonstrate strong general reasoning and thinking capabilities, they still lack the domain expertise and task-awareness required for precise surgical scene interpretation. Although Chain-of-Thought (CoT) can structure reasoning more effectively, current approaches rely on self-generated CoT steps, which often exacerbate inherent domain gaps and hallucinations. To overcome this, we present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery. By employing specialized CoT prompts across five tasks: instrument recognition, action recognition, action prediction, patient data extraction, and outcome assessment, SurgRAW mitigates hallucinations through structured, domain-aware reasoning. Retrieval-Augmented Generation (RAG) is also integrated to external medical knowledge to bridge domain gaps and improve response reliability. Most importantly, a hierarchical agentic system ensures that CoT-embedded VLM agents collaborate effectively while understanding task interdependencies, with a panel discussion mechanism promotes logical consistency. To evaluate our method, we introduce SurgCoTBench, the first reasoning-based dataset with structured frame-level annotations. With comprehensive experiments, we demonstrate the effectiveness of proposed SurgRAW with 29.32% accuracy improvement over baseline VLMs on 12 robotic procedures, achieving the state-of-the-art performance and advancing explainable, trustworthy, and autonomous surgical assistance.

SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence

TL;DR

The paper tackles unreliable surgical scene interpretation due to hallucinations and domain gaps in Vision-Language Models. It introduces SurgRAW, a hierarchical, multi-agent framework where task-specific Chain-of-Thought prompts guide CoT-embedded VLM agents across five tasks, complemented by Retrieval-Augmented Generation and a panel-discussion mechanism for grounding and consistency. A new SurgCoTBench dataset provides frame-level, reasoning-focused evaluation across robotic procedures. Experiments show substantial accuracy improvements, outperforming baselines and establishing state-of-the-art performance for explainable, autonomous surgical assistance.

Abstract

Integration of Vision-Language Models (VLMs) in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies within surgical scenes, undermining clinical reliability. While recent VLMs demonstrate strong general reasoning and thinking capabilities, they still lack the domain expertise and task-awareness required for precise surgical scene interpretation. Although Chain-of-Thought (CoT) can structure reasoning more effectively, current approaches rely on self-generated CoT steps, which often exacerbate inherent domain gaps and hallucinations. To overcome this, we present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery. By employing specialized CoT prompts across five tasks: instrument recognition, action recognition, action prediction, patient data extraction, and outcome assessment, SurgRAW mitigates hallucinations through structured, domain-aware reasoning. Retrieval-Augmented Generation (RAG) is also integrated to external medical knowledge to bridge domain gaps and improve response reliability. Most importantly, a hierarchical agentic system ensures that CoT-embedded VLM agents collaborate effectively while understanding task interdependencies, with a panel discussion mechanism promotes logical consistency. To evaluate our method, we introduce SurgCoTBench, the first reasoning-based dataset with structured frame-level annotations. With comprehensive experiments, we demonstrate the effectiveness of proposed SurgRAW with 29.32% accuracy improvement over baseline VLMs on 12 robotic procedures, achieving the state-of-the-art performance and advancing explainable, trustworthy, and autonomous surgical assistance.

Paper Structure

This paper contains 9 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The overall structure of SurgRAW processes surgical queries through hierarchical orchestrators and CoT-embedded expert agents, with RAG and panel discussions enhancing accuracy and domain reliability.
  • Figure 2: An example chat board for SurgRAW framework, illustrates the workflow and the response.
  • Figure 2: Ablation study results (%). "Avg." means average, and "PD." for panel discussion.
  • Figure 3: The case study for three tasks under different prompts. Red text indicates incorrect answers, while green text highlights correct responses.
  • Figure 4: The comparison with traditional VQA method.