The SAGES Critical View of Safety Challenge: A Global Benchmark for AI-Assisted Surgical Quality Assessment
Deepak Alapatt, Jennifer Eckhoff, Zhiliang Lyu, Yutong Ban, Jean-Paul Mazellier, Sarah Choksi, Kunyi Yang, Po-Hsing Chiang, Noemi Zorzetti, Samuele Cannas, Daniel Neimark, Omri Bar, Amine Yamlahi, Jakob Hennighausen, Xiaohan Wang, Rui Li, Long Liang, Yuxian Wang, Saurabh Koju, Binod Bhattarai, Tim Jaspers, Zhehua Mao, Anjana Wijekoon, Jun Ma, Yinan Xu, Zhilong Weng, Ammar M. Okran, Hatem A. Rashwan, Boyang Shen, Kaixiang Yang, Yutao Zhang, Hao Wang, 2024 CVS Challenge Consortium, Quanzheng Li, Filippo Filicori, Xiang Li, Pietro Mascagni, Daniel A. Hashimoto, Guy Rosman, Ozanan Meireles, Nicolas Padoy
TL;DR
The paper introduces the SAGES Critical View of Safety (CVS) Challenge, the first biomedical AI competition organized by a surgical society, to benchmark AI-driven surgical quality assessment using CVS in laparoscopic cholecystectomy. It presents EndoGlacier, a scalable orchestration framework for global video collection, multi-annotator workflows, and quality control, enabling 1,000 annotated videos from 54 institutions across 24 countries. The study evaluates 13 teams across three subchallenges—CVS achievement, uncertainty quantification, and domain robustness—demonstrating substantial improvements over a baseline and highlighting where robustness, accuracy, and calibration intersect. Key findings show transformer-based and hybrid architectures, surgical-pretraining, ensembling, and auxiliary objectives drive gains, while robustness under distribution shifts remains a crucial area for deployment. The benchmark, with its rich dataset and calibration-focused evaluation, provides a practical, scalable path toward trustworthy AI for surgical safety and can guide future research and deployment efforts. $ ext{mAP}$, $ ext{Brier}$, and Domain Robustness scores are used to quantify performance across accuracy, uncertainty, and generalization axes, respectively, illuminating trade-offs essential for clinical adoption.
Abstract
Advances in artificial intelligence (AI) for surgical quality assessment promise to democratize access to expertise, with applications in training, guidance, and accreditation. This study presents the SAGES Critical View of Safety (CVS) Challenge, the first AI competition organized by a surgical society, using the CVS in laparoscopic cholecystectomy, a universally recommended yet inconsistently performed safety step, as an exemplar of surgical quality assessment. A global collaboration across 54 institutions in 24 countries engaged hundreds of clinicians and engineers to curate 1,000 videos annotated by 20 surgical experts according to a consensus-validated protocol. The challenge addressed key barriers to real-world deployment in surgery, including achieving high performance, capturing uncertainty in subjective assessment, and ensuring robustness to clinical variability. To enable this scale of effort, we developed EndoGlacier, a framework for managing large, heterogeneous surgical video and multi-annotator workflows. Thirteen international teams participated, achieving up to a 17% relative gain in assessment performance, over 80% reduction in calibration error, and a 17% relative improvement in robustness over the state-of-the-art. Analysis of results highlighted methodological trends linked to model performance, providing guidance for future research toward robust, clinically deployable AI for surgical quality assessment.
