Table of Contents
Fetching ...

A generalizable foundation model for intraoperative understanding across surgical procedures

Kanggil Park, Yongjun Jeon, Soyoung Lim, Seonmin Park, Jongmin Shin, Jung Yong Kim, Sehyeon An, Jinsoo Rhu, Jongman Kim, Gyu-Seong Choi, Namkee Oh, Kyu-Hwan Jung

TL;DR

ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework, is introduced and a step toward unified representations for surgical scene understanding is suggested.

Abstract

In minimally invasive surgery, clinical decisions depend on real-time visual interpretation, yet intraoperative perception varies substantially across surgeons and procedures. This variability limits consistent assessment, training, and the development of reliable artificial intelligence systems, as most surgical AI models are designed for narrowly defined tasks and do not generalize across procedures or institutions. Here we introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework. We curated a large and diverse dataset and systematically evaluated multiple representation learning strategies within a unified benchmark. Across 20 downstream tasks and full fine-tuning, frozen-backbone, few-shot and zero-shot settings, ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization. These results suggest a step toward unified representations for surgical scene understanding and support future applications in intraoperative assistance and surgical training assessment.

A generalizable foundation model for intraoperative understanding across surgical procedures

TL;DR

ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework, is introduced and a step toward unified representations for surgical scene understanding is suggested.

Abstract

In minimally invasive surgery, clinical decisions depend on real-time visual interpretation, yet intraoperative perception varies substantially across surgeons and procedures. This variability limits consistent assessment, training, and the development of reliable artificial intelligence systems, as most surgical AI models are designed for narrowly defined tasks and do not generalize across procedures or institutions. Here we introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework. We curated a large and diverse dataset and systematically evaluated multiple representation learning strategies within a unified benchmark. Across 20 downstream tasks and full fine-tuning, frozen-backbone, few-shot and zero-shot settings, ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization. These results suggest a step toward unified representations for surgical scene understanding and support future applications in intraoperative assistance and surgical training assessment.
Paper Structure (30 sections, 4 equations, 13 figures, 33 tables)

This paper contains 30 sections, 4 equations, 13 figures, 33 tables.

Figures (13)

  • Figure 1: Overview of this study.a, Large-scale pretraining dataset comprising over 4.3 million frames from over 4,780 minimally invasive surgery videos, spanning over 21 procedures across 10 organs. b, Comparison of pretraining strategies. Various self-supervised methods were evaluated across surgical downstream tasks, after which the best-performing strategy was scaled up and compared against existing pretrained models. c, Architecture of ZEN and the multi-teacher distillation framework. ZEN is trained via feature-level distillation from multiple frozen expert teachers, including MIS-DINOv2 (ViT-Large) and PeskaVLP. d--f, Downstream surgical evaluation tasks. d, Surgical workflow understanding: surgical phase recognition, surgical action triplet recognition, and skill assessment. e, Dense spatial understanding: semantic segmentation, instance segmentation, and monocular depth estimation. f, Vision–language understanding: closed-ended and open-ended visual question answering, cross-modal retrieval, and zero-shot phase recognition.
  • Figure 1: Comprehensive comparison of self-supervised and existing pretrained models.a, Average performance in the frozen-backbone setting across 15 supervised downstream tasks for self-supervised pretrained models trained on minimally invasive surgical videos. Task-specific representative metrics are used, including the average of video-level macro F1 score and accuracy for surgical phase recognition; mean average precision (mAP) for surgical action triplet recognition; mAP for skill assessment; Dice score for semantic segmentation; the average of detection and segmentation mAP for instance segmentation; 1 $-$ absolute relative error for depth estimation; the average of macro F1 score and balanced accuracy for closed-ended visual question answering (VQA); and the average of BLEU, ROUGE-L, and METEOR scores for open-ended VQA. b, Ranking heatmap of self-supervised pretrained models across the same 15 supervised tasks in the frozen-backbone setting, based on the corresponding representative task-level metrics. c, Average performance in the frozen-backbone setting across the 15 supervised tasks for MIS-DINOv2 (ViT-L) and existing pretrained models, computed using the representative metrics as in a. d, Ranking heatmap comparing MIS-DINOv2 (ViT-L) and existing pretrained models across the 15 supervised tasks in the frozen-backbone setting. For a and c, error bars indicate 95% confidence intervals. $P$ values were calculated using a two-sided Wilcoxon signed-rank test.
  • Figure 2: Generalization performance of surgical foundation models on comprehensive clinical benchmarks.a, Comparison of downstream tasks targeted by existing surgical foundation models. Note that although EndoFM surgfm2 utilizes MIS videos for pretraining, its target tasks primarily focused on gastrointestinal endoscopy. b, ZEN outperforms other pretrained models across 20 clinical tasks in surgical video. c, Average performance in the frozen-backbone setting across 15 supervised tasks, computed using task-specific representative metrics: video-level macro F1 score and accuracy for phase recognition; triplet (IVT) mean average precision (mAP) for action triplet recognition; mAP for skill assessment; Dice score for semantic segmentation; the average of detection and segmentation mAP for instance segmentation; 1 $-$ absolute relative error for depth estimation; the average of macro F1 score and balanced accuracy for closed-ended VQA; and the average of BLEU, ROUGE-L, and METEOR for open-ended VQA. d, Ranking heatmap across the supervised tasks in the frozen-backbone setting, based on representative task-level metrics. e, Average performance in the fine-tuned backbone setting across the supervised tasks, computed using representative metrics. f, Ranking heatmap across the supervised tasks in the fine-tuned backbone setting. Error bars in c and e indicate 95% confidence intervals. $P$ values were calculated using a two-sided Wilcoxon signed-rank test.
  • Figure 2: Performance comparison for surgical workflow understanding.a, Performance of ZEN and other pretrained models on surgical phase recognition across three datasets in the frozen-backbone setting. Metrics include video-level macro F1 score, accuracy, and phase-level Jaccard index. $C$ denotes the number of phases. b, Surgical action triplet recognition performance across two datasets in the frozen-backbone setting. Performance is evaluated using mean Average Precision (mAP) for instrument (I), verb (V), and target (T) components, as well as their combinations. $C$ denotes the number of triplet (IVT) classes. c, Skill assessment performance using mAP and macro F1 score in the frozen-backbone setting. $C$ denotes the number of safety criteria. Error bars represent 95% confidence intervals over five independent runs ($n=5$). $P$ values were calculated using two-sided paired $t$-test.
  • Figure 3: Performance comparison for surgical workflow understanding.a, Performance of ZEN and other pretrained models on surgical phase recognition across three datasets in the fine-tuned backbone setting. Metrics include video-level macro F1 score, accuracy, and phase-level Jaccard index. $C$ denotes the number of phases. b, Surgical action triplet recognition performance across two datasets in the fine-tuned backbone setting. Performance is evaluated using mean average precision (mAP) for instrument (I), verb (V), and target (T) components, as well as their combinations. $C$ denotes the number of triplet (IVT) classes. c, Skill assessment performance in the fine-tuned backbone setting, evaluated using mAP and macro F1 score. $C$ denotes the number of safety criteria. d, Few-shot surgical phase recognition performance using 1--5 training videos across three datasets. Results represent five independent runs ($n=5$) for each model and shot condition. The center of each box indicates the mean, box bounds denote the standard error, and whiskers indicate the lower and upper bounds of the 95% confidence interval (CI). For a--c, error bars represent 95% CIs over five independent runs ($n=5$). $P$ values were calculated using two-sided paired $t$-test.
  • ...and 8 more figures