Table of Contents
Fetching ...

UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Jinlin Wu, Felix Holm, Chuxi Chen, An Wang, Yaxin Hu, Xiaofan Ye, Zelin Zang, Miao Xu, Lihua Zhou, Huai Liao, Danny T. M. Chan, Ming Feng, Wai S. Poon, Hongliang Ren, Dong Yi, Nassir Navab, Gaofeng Meng, Jiebo Luo, Hongbin Liu, Zhen Lei

TL;DR

This paper identifies a fundamental mismatch in surgical video understanding: existing foundation models rely on pixel-level reconstruction that wastes capacity on high-frequency noise, hindering semantic comprehension of surgical dynamics. It proposes UniSurg, a video-native foundation model built on the latent-space learning framework V-JEPA, equipped with motion-guided latent masked prediction, spatiotemporal affinity self-distillation, and variance regularization, trained on the large UniSurg-15M dataset. Through extensive experiments across 17 benchmarks, UniSurg delivers state-of-the-art performance in surgical workflow recognition, fine-grained action understanding, and dense perception tasks such as polyp segmentation and depth estimation, demonstrating strong cross-procedure generalization. The work underscores the potential of motion-focused, relationally consistent latent representations for universal surgical video understanding and motivates ongoing multi-institutional data aggregation for robust deployment.

Abstract

While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.

UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

TL;DR

This paper identifies a fundamental mismatch in surgical video understanding: existing foundation models rely on pixel-level reconstruction that wastes capacity on high-frequency noise, hindering semantic comprehension of surgical dynamics. It proposes UniSurg, a video-native foundation model built on the latent-space learning framework V-JEPA, equipped with motion-guided latent masked prediction, spatiotemporal affinity self-distillation, and variance regularization, trained on the large UniSurg-15M dataset. Through extensive experiments across 17 benchmarks, UniSurg delivers state-of-the-art performance in surgical workflow recognition, fine-grained action understanding, and dense perception tasks such as polyp segmentation and depth estimation, demonstrating strong cross-procedure generalization. The work underscores the potential of motion-focused, relationally consistent latent representations for universal surgical video understanding and motivates ongoing multi-institutional data aggregation for robust deployment.

Abstract

While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
Paper Structure (43 sections, 14 equations, 4 figures, 9 tables)

This paper contains 43 sections, 14 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of UniSurg. (a) The UniSurg-15M dataset aggregates 3,658 hours of surgical video spanning 50 sources, 13+ anatomical regions, and 100+ procedures. (b) Pre-training data duration comparison with existing surgical corpora, highlighting the significant scale gap. (c) Empirical scaling trend showing improved workflow recognition performance with larger pre-training data; the marker size indicates model parameters. (d) The UniSurg pretraining framework: motion-guided latent masked prediction, spatiotemporal consistency via affinity self-distillation, and spatiotemporal feature diversity regularization, together enabling video-native universal representations transferable to diverse downstream tasks. (e) Summary of downstream performance gains across recognition and dense prediction benchmarks, and across surgical specialties.
  • Figure 2: Visualization of Workflow Recognition.
  • Figure 3: Qualitative Results on Polyp Segmentation and Colonoscopic Depth Estimation. Columns show the input frame, predictions from SurgVLP, EndoFM, VideoMAE-G, GastroNet, DINOv3-H, and our UniSurg, as well as the ground truth (GT). UniSurg yields tighter polyp boundaries under domain shift and more spatially coherent depth maps across anatomical segments.
  • Figure 4: Illustration of surgical action triplet recognition on CholecT50 dataset. These frames from a CholecT50 laparoscopic cholecystectomy video exemplify the model's capability to disentangle overlapping surgical actions.