Table of Contents
Fetching ...

How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

Zhen Chen, Qing Xu, Jinlin Wu, Biao Yang, Yuhao Zhai, Geng Guo, Jing Zhang, Yinlu Ding, Nassir Navab, Jiebo Luo

TL;DR

The paper investigates whether contemporary video-generation systems can serve as world models in surgery by introducing SurgVeo, a surgeon-curated benchmark, and the Surgical Plausibility Pyramid (SPP) to evaluate outputs from appearance to surgical strategy. Using Veo-3 in a zero-shot setup on clips from laparoscopic hysterectomy and endoscopic pituitary surgery, the authors show a pronounced plausibility gap: generated videos achieve high Visual Perceptual Plausibility but fail to demonstrate Instrument Operation, Environment Feedback, or Surgical Intent Plausibility, even with stage-aware prompting. This work provides the first quantitative, expert-driven demonstration that visually convincing surgery videos do not encode the deep causal and procedural knowledge required for realistic surgical reasoning, and it offers a concrete framework and data for guiding future development of domain-specific, physics- and knowledge-informed world models. The SurgVeo benchmark and SPP thus establish a roadmap for advancing surgical AI toward applications in training, planning, and intraoperative support by aligning generative capabilities with real-world clinical reasoning.

Abstract

Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.

How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

TL;DR

The paper investigates whether contemporary video-generation systems can serve as world models in surgery by introducing SurgVeo, a surgeon-curated benchmark, and the Surgical Plausibility Pyramid (SPP) to evaluate outputs from appearance to surgical strategy. Using Veo-3 in a zero-shot setup on clips from laparoscopic hysterectomy and endoscopic pituitary surgery, the authors show a pronounced plausibility gap: generated videos achieve high Visual Perceptual Plausibility but fail to demonstrate Instrument Operation, Environment Feedback, or Surgical Intent Plausibility, even with stage-aware prompting. This work provides the first quantitative, expert-driven demonstration that visually convincing surgery videos do not encode the deep causal and procedural knowledge required for realistic surgical reasoning, and it offers a concrete framework and data for guiding future development of domain-specific, physics- and knowledge-informed world models. The SurgVeo benchmark and SPP thus establish a roadmap for advancing surgical AI toward applications in training, planning, and intraoperative support by aligning generative capabilities with real-world clinical reasoning.

Abstract

Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.

Paper Structure

This paper contains 36 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: (a) The Surgical Plausibility Pyramid (SPP) framework, illustrating four hierarchical assessment dimensions: (i) Visual Perceptual Plausibility at the appearance level, assessing the clarity and stability of generated videos, (ii) Instrument Operation Plausibility at the action level, judging the accuracy and technical proficiency of instrument manipulation, (iii) Environment Feedback Plausibility at the consequence level, measuring the realism and credibility of scene feedback, and (iv) Surgical Intent Plausibility at the Strategy level, evaluating the appropriateness and clinical reasoning of surgical actions. (b) Detailed 5-point scoring rubrics (5=excellent to 1=poor) for evaluating each dimension.
  • Figure 2: The overall pipeline of this study. (a) The overview of the SurgVeo benchmark preparation and evaluation workflow. The surgical video dataset is processed to create the paired surgical frame and surgical video continuation. The Veo model takes the surgical frame with a prompt as input to generate the surgical video prediction. A panel of four board-certified surgeons evaluates the generated surgical videos against the real surgical video continuation as reference under the Surgical Plausibility Pyramid (SPP). (b) The illustration of the generation and evaluation process for a single sample in the SurgVeo benchmark. A starting surgical frame and a text prompt are fed into the Veo model to generate an 8-second surgical video prediction. This output is then scored by expert surgeons by comparing it to the real 8-second reference video with a focus on four dimensions of surgical plausibility, particularly at the 1-second, 3-second, and 8-second time points.
  • Figure 3: Violin plots illustrating the performance on the laparoscopic surgery track in the SurgVeo benchmark. Results are shown for (a) the baseline prompt and (b) the stage-aware prompt. The performance is assessed across four evaluation dimensions in the SPP, with three progressively deeper shades representing evaluations at 1-second, 3-second, and 8-second. Each sample point reflects the average score provided by two laparoscopic surgery experts.
  • Figure 4: Violin plots illustrating the performance on the neurosurgery track in the SurgVeo benchmark. Results are shown for (a) the baseline prompt and (b) the stage-aware prompt. The performance is assessed across four evaluation dimensions in the SPP, with three progressively deeper shades representing evaluations at 1-second, 3-second, and 8-second. Each sample point reflects the average score provided by two neurosurgery experts.
  • Figure 5: Qualitative examples of typical failures identified in the generated videos. Each example presents a side-by-side comparison of the real surgical frame (left) and the generated surgical frame (right). These examples elaborate on failures across the Surgical Plausibility Pyramid, including: (a) visual quality distortions, (b) surgical instrument errors, (c) inappropriate surgical operations, (d) inappropriate surgical targets, (e) environment feedback errors, and (f) surgical intent errors. Red arrows indicate specific illogical, anatomically incorrect, or physically impossible artifacts.
  • ...and 4 more figures