Table of Contents
Fetching ...

The SAGES Critical View of Safety Challenge: A Global Benchmark for AI-Assisted Surgical Quality Assessment

Deepak Alapatt, Jennifer Eckhoff, Zhiliang Lyu, Yutong Ban, Jean-Paul Mazellier, Sarah Choksi, Kunyi Yang, Po-Hsing Chiang, Noemi Zorzetti, Samuele Cannas, Daniel Neimark, Omri Bar, Amine Yamlahi, Jakob Hennighausen, Xiaohan Wang, Rui Li, Long Liang, Yuxian Wang, Saurabh Koju, Binod Bhattarai, Tim Jaspers, Zhehua Mao, Anjana Wijekoon, Jun Ma, Yinan Xu, Zhilong Weng, Ammar M. Okran, Hatem A. Rashwan, Boyang Shen, Kaixiang Yang, Yutao Zhang, Hao Wang, 2024 CVS Challenge Consortium, Quanzheng Li, Filippo Filicori, Xiang Li, Pietro Mascagni, Daniel A. Hashimoto, Guy Rosman, Ozanan Meireles, Nicolas Padoy

TL;DR

The paper introduces the SAGES Critical View of Safety (CVS) Challenge, the first biomedical AI competition organized by a surgical society, to benchmark AI-driven surgical quality assessment using CVS in laparoscopic cholecystectomy. It presents EndoGlacier, a scalable orchestration framework for global video collection, multi-annotator workflows, and quality control, enabling 1,000 annotated videos from 54 institutions across 24 countries. The study evaluates 13 teams across three subchallenges—CVS achievement, uncertainty quantification, and domain robustness—demonstrating substantial improvements over a baseline and highlighting where robustness, accuracy, and calibration intersect. Key findings show transformer-based and hybrid architectures, surgical-pretraining, ensembling, and auxiliary objectives drive gains, while robustness under distribution shifts remains a crucial area for deployment. The benchmark, with its rich dataset and calibration-focused evaluation, provides a practical, scalable path toward trustworthy AI for surgical safety and can guide future research and deployment efforts. $ ext{mAP}$, $ ext{Brier}$, and Domain Robustness scores are used to quantify performance across accuracy, uncertainty, and generalization axes, respectively, illuminating trade-offs essential for clinical adoption.

Abstract

Advances in artificial intelligence (AI) for surgical quality assessment promise to democratize access to expertise, with applications in training, guidance, and accreditation. This study presents the SAGES Critical View of Safety (CVS) Challenge, the first AI competition organized by a surgical society, using the CVS in laparoscopic cholecystectomy, a universally recommended yet inconsistently performed safety step, as an exemplar of surgical quality assessment. A global collaboration across 54 institutions in 24 countries engaged hundreds of clinicians and engineers to curate 1,000 videos annotated by 20 surgical experts according to a consensus-validated protocol. The challenge addressed key barriers to real-world deployment in surgery, including achieving high performance, capturing uncertainty in subjective assessment, and ensuring robustness to clinical variability. To enable this scale of effort, we developed EndoGlacier, a framework for managing large, heterogeneous surgical video and multi-annotator workflows. Thirteen international teams participated, achieving up to a 17% relative gain in assessment performance, over 80% reduction in calibration error, and a 17% relative improvement in robustness over the state-of-the-art. Analysis of results highlighted methodological trends linked to model performance, providing guidance for future research toward robust, clinically deployable AI for surgical quality assessment.

The SAGES Critical View of Safety Challenge: A Global Benchmark for AI-Assisted Surgical Quality Assessment

TL;DR

The paper introduces the SAGES Critical View of Safety (CVS) Challenge, the first biomedical AI competition organized by a surgical society, to benchmark AI-driven surgical quality assessment using CVS in laparoscopic cholecystectomy. It presents EndoGlacier, a scalable orchestration framework for global video collection, multi-annotator workflows, and quality control, enabling 1,000 annotated videos from 54 institutions across 24 countries. The study evaluates 13 teams across three subchallenges—CVS achievement, uncertainty quantification, and domain robustness—demonstrating substantial improvements over a baseline and highlighting where robustness, accuracy, and calibration intersect. Key findings show transformer-based and hybrid architectures, surgical-pretraining, ensembling, and auxiliary objectives drive gains, while robustness under distribution shifts remains a crucial area for deployment. The benchmark, with its rich dataset and calibration-focused evaluation, provides a practical, scalable path toward trustworthy AI for surgical safety and can guide future research and deployment efforts. , , and Domain Robustness scores are used to quantify performance across accuracy, uncertainty, and generalization axes, respectively, illuminating trade-offs essential for clinical adoption.

Abstract

Advances in artificial intelligence (AI) for surgical quality assessment promise to democratize access to expertise, with applications in training, guidance, and accreditation. This study presents the SAGES Critical View of Safety (CVS) Challenge, the first AI competition organized by a surgical society, using the CVS in laparoscopic cholecystectomy, a universally recommended yet inconsistently performed safety step, as an exemplar of surgical quality assessment. A global collaboration across 54 institutions in 24 countries engaged hundreds of clinicians and engineers to curate 1,000 videos annotated by 20 surgical experts according to a consensus-validated protocol. The challenge addressed key barriers to real-world deployment in surgery, including achieving high performance, capturing uncertainty in subjective assessment, and ensuring robustness to clinical variability. To enable this scale of effort, we developed EndoGlacier, a framework for managing large, heterogeneous surgical video and multi-annotator workflows. Thirteen international teams participated, achieving up to a 17% relative gain in assessment performance, over 80% reduction in calibration error, and a 17% relative improvement in robustness over the state-of-the-art. Analysis of results highlighted methodological trends linked to model performance, providing guidance for future research toward robust, clinically deployable AI for surgical quality assessment.

Paper Structure

This paper contains 55 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: A global benchmark for surgical safety AI, the challenge brought together over 1000 annotated cholecystectomy videos, representing diverse acquisition devices (hardware, instrumentation, etc.) and workflows (case difficulty, techniques, preferences, instrumentation, etc.). The dataset was constructed through a global collaborative effort spanning three years, with the bottom-left panel illustrating countries contributing data, annotations, and methods to this international initiative.
  • Figure 2: Overview of the annotation infrastructure. The pipeline integrated three coordinated flows: (i) Annotator management, encompassing recruitment, onboarding, training, and competency validation; (ii) Orchestration and automation, ensuring blinded allocation of videos, controlled pacing of assignments, reminders, and error logging; and (iii) Video management, including privacy safeguards, eligibility screening, and reprocessing to yield a curated pool of qualified surgical videos.
  • Figure 3: Distribution of adjunct imaging techniques and surgical platform across training and test sets. Proportion of videos using intraoperative cholangiography (IOC), indocyanine green (ICG) fluorescence imaging, and robotic-assisted surgery. The distribution demonstrates clinical and workflow diversity across the dataset, with variation between training and test splits done randomly to reflect unselected real-world patterns.
  • Figure 4: Distribution of CVS assessment outcomes and annotator agreement. For each CVS criterion (C1, C2, C3), we report the number of clips rated as "Achieved" vs. "Not Achieved," and the corresponding level of annotator agreement (full agreement across all three annotators vs. partial agreement). This highlights both the subjective nature of CVS assessment and the distribution of clinical outcomes in the dataset.
  • Figure 5: Overview of the SAGES CVS Challenge. The top panel shows dataset release and timeline, the middle panel summarizes team participation, and the bottom panel highlights the core tooling stack used to manage communication, data delivery, and evaluation.
  • ...and 6 more figures