Table of Contents
Fetching ...

Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

Mengya Xu, Daiyun Shen, Jie Zhang, Hon Chi Yip, Yujia Gao, Cheng Chen, Dillan Imans, Yonghao Long, Yiru Ye, Yixiao Liu, Rongyun Mai, Kai Chen, Hongliang Ren, Yutong Ban, Guangsuo Wang, Francis Wong, Chi-Fai Ng, Kee Yuan Ngiam, Russell H. Taylor, Daguang Xu, Yueming Jin, Qi Dou

Abstract

Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific knowledge, and action planning in cholecystectomy and nephrectomy using large vision-language models. Multinational surgeons' evaluation of the language model's output of the action planning explainable texts demonstrated clinical relevance. These findings indicate that basic surgical actions can be robustly recognized across scenarios, and an accurate BSA understanding model can essentially facilitate complex applications and speed up the realization of surgical superintelligence.

Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

Abstract

Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific knowledge, and action planning in cholecystectomy and nephrectomy using large vision-language models. Multinational surgeons' evaluation of the language model's output of the action planning explainable texts demonstrated clinical relevance. These findings indicate that basic surgical actions can be robustly recognized across scenarios, and an accurate BSA understanding model can essentially facilitate complex applications and speed up the realization of surgical superintelligence.
Paper Structure (15 sections, 18 equations, 8 figures, 6 tables)

This paper contains 15 sections, 18 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Illustration of our BSA-10 dataset.a Dataset generation pipeline. The process involves four key steps: BSA criteria establishment, reuse of existing annotations, annotation of new BSA clips, and overall dataset review; b Our dataset is collected from $6$ body parts (i.e., gallbladder, stomach, kidney, intestine, prostate gland, and uterus) and sourced from $15$ public datasets and our SurgYT collection (surgical videos from credentialed YouTube channels); c Description of basic surgical actions; d Statistical analysis of the dataset across different basic surgical actions and procedure types.
  • Figure 2: Analysis results of 10-fold cross-validation on the developed dataset.a The receiver operating characteristic (ROC) curve of ten action classes; b Results of the model on ten action classes based on the Youden Index which are presented as 95% confidence interval; c The confusion matrix across ten action classes by aggregating the individual confusion matrices from each of the ten folds; d Surgery-wise performance metrics, displayed with 95% confidence intervals; e Representative frames from ten basic surgical actions across eight surgery types, highlighting intra-class variability.
  • Figure 3: Action barcode visualization of BSA distributions across expertise levels. The skill analysis for the three surgical procedures conducted by the a experienced consultant, b the early consultant, and c the junior registrar in RARP50. The action barcode delineates three BSAs, namely needle grasping, needle puncture, and suture pulling, represented by distinct colors: purple, blue, and orange, respectively. d Quantitative comparison of multiple-attempt frequency and idle-state proportion across three surgical expertise levels.
  • Figure 4: BSA for surgical action planning with our AI agent.a Our AI Agent integrates surgical context knowledge, historical data, and current observations as inputs, generating a response upon receiving a user prompt. b Our AI Agent's reasoning process through two representative examples from the C-CVS scenario. c Next action prediction results. We report the local and global accuracy under both strict and relaxed conditions for the C-CVS and N-RCS scenarios.
  • Figure 5: Multi-national surgeon evaluation of AI-generated surgical reasoning and action recommendations.a Comprehensive scoring evaluation process by surgeons across five clinical criteria, including terminology correctness, description correctness, description completeness, safety considerations, and next action reasonableness. b Quantitative surgeon scoring results from Hong Kong and Singapore surgeons across 98 samples, demonstrating international consensus on AI output quality. c Representative examples of detailed surgeon scoring with specific feedback highlighting clinical judgment variations and the importance of flexible evaluation metrics in surgical AI systems.
  • ...and 3 more figures