Table of Contents
Fetching ...

Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining

Keyu Zhou, Peisen Xu, Yahao Wu, Jiming Chen, Gaofeng Li, Shunlei Li

TL;DR

The proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error and image shaking by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.

Abstract

Autonomous laparoscopic camera control must maintain a stable and safe surgical view under rapid tool-tissue interactions while remaining interpretable to surgeons. We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control. Offline, raw surgical videos are parsed into camera-relevant temporal events (e.g., interaction, working-distance deviation, and view-quality degradation) and structured as attributed event graphs. Mining these graphs yields a compact set of reusable camera-handling strategy primitives, which provide structured supervision for learning. Online, a fine-tuned Vision-Language Model (VLM) processes the live laparoscopic view to predict the dominant strategy and discrete image-based motion commands, executed by an IBVS-RCM controller under strict safety constraints; optional speech input enables intuitive human-in-the-loop conditioning. On a surgeon-annotated dataset, event parsing achieves reliable temporal localization (F1-score 0.86), and the mined strategies show strong semantic alignment with expert interpretation (cluster purity 0.81). Extensive ex vivo experiments on silicone phantoms and porcine tissues demonstrate that the proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.

Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining

TL;DR

The proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error and image shaking by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.

Abstract

Autonomous laparoscopic camera control must maintain a stable and safe surgical view under rapid tool-tissue interactions while remaining interpretable to surgeons. We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control. Offline, raw surgical videos are parsed into camera-relevant temporal events (e.g., interaction, working-distance deviation, and view-quality degradation) and structured as attributed event graphs. Mining these graphs yields a compact set of reusable camera-handling strategy primitives, which provide structured supervision for learning. Online, a fine-tuned Vision-Language Model (VLM) processes the live laparoscopic view to predict the dominant strategy and discrete image-based motion commands, executed by an IBVS-RCM controller under strict safety constraints; optional speech input enables intuitive human-in-the-loop conditioning. On a surgeon-annotated dataset, event parsing achieves reliable temporal localization (F1-score 0.86), and the mined strategies show strong semantic alignment with expert interpretation (cluster purity 0.81). Extensive ex vivo experiments on silicone phantoms and porcine tissues demonstrate that the proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.
Paper Structure (55 sections, 71 equations, 9 figures, 3 tables)

This paper contains 55 sections, 71 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of the proposed strategy-supervised autonomous laparoscopic camera control framework: The bottom block illustrates the offline pipeline, where laparoscopic video demonstrations are parsed into camera-relevant events, organized as attributed event graphs, and clustered to discover dominant camera-handling strategy primitives. These clusters generate strategy labels and corresponding direction supervision signals. The top row shows the online control pipeline: a laparoscopic frame (with optional speech input) is processed by a vision-language backbone with two prediction heads that output a dominant strategy label and discrete 6-DoF motion directions. The predicted directions are executed by an IBVS–RCM controller to maintain stable, safe, and context-aware camera positioning, with surgeon-in-the-loop safety override.
  • Figure 2: Event parsing from raw laparoscopic video: Three branches detect interaction-driven events, depth-change events, and view-quality constraint events (visibility degradation and lens contamination), whose candidate intervals are fused by temporal event segmentation to form structured events $e_k=(\text{type}, t_s, t_e, x_k)$. The bottom timeline illustrates the resulting aligned event segments over time.
  • Figure 3: Event graph representation and strategy mining: Camera-relevant events are represented as nodes in an attributed graph, with temporal and semantic relations encoded as heterogeneous edges. WSBGC groups events into coherent clusters corresponding to reusable camera-handling strategies.
  • Figure 4: Experimental setup: From left to right, the surgical equipments used in the experiments, including the synthetic abdominal cavity, surgical instruments, and anti-fog devices; an overview of the surgical workspace with the robotic manipulator; and the manipulated objects, consisting of a silicone phantom and porcine tissues used to mimic real organ.
  • Figure 5: Experimental validation of the proposed autonomous laparoscopic control method in complex surgical tasks: Upper part: Illustration of a multi-site dissection task performed on porcine gastric tissue. The images demonstrate the autonomous following behavior of the laparoscope, as well as its ability to adapt the penetration ratio in response to shifts in the surgical instrument’s workspace. Lower part: Validation during a simulated silicone suturing task. The system exhibits multi-modal interaction capability, allowing the surgeon to refine the laparoscope pose via voice commands without interrupting the manual suturing process. The timestamps shown in the bottom-right corner highlight the continuous operation and efficiency of the proposed robotic assistance system.
  • ...and 4 more figures