Table of Contents
Fetching ...

Surgment: Segmentation-enabled Semantic Search and Creation of Visual Question and Feedback to Support Video-Based Surgery Learning

Jingying Wang, Haoran Tang, Taylor Kantor, Tandis Soltani, Vitaliy Popov, Xu Wang

TL;DR

Surgment addresses the need for active, visual-focused surgical learning by integrating a SegGPT+SAM segmentation pipeline with two key interfaces: search-by-mask for frame retrieval and a quiz-maker for image-based questions and feedback. The approach achieves a Dice score of 0.92 on lap chole datasets and enables high-education-value content validated by 11 expert surgeons, highlighting improvements over traditional, text-centric question generation. The study demonstrates the importance of human expert input in AI-assisted content creation, identifies UI and generalizability challenges, and points to future directions including voice interfaces, finer segmentation, and AR-enabled in-OR teaching. Collectively, Surgment offers a practical pathway to enhance preoperative preparation and surgical training through interactive, visual learning materials grounded in authentic operative scenes.

Abstract

Videos are prominent learning materials to prepare surgical trainees before they enter the operating room (OR). In this work, we explore techniques to enrich the video-based surgery learning experience. We propose Surgment, a system that helps expert surgeons create exercises with feedback based on surgery recordings. Surgment is powered by a few-shot-learning-based pipeline (SegGPT+SAM) to segment surgery scenes, achieving an accuracy of 92\%. The segmentation pipeline enables functionalities to create visual questions and feedback desired by surgeons from a formative study. Surgment enables surgeons to 1) retrieve frames of interest through sketches, and 2) design exercises that target specific anatomical components and offer visual feedback. In an evaluation study with 11 surgeons, participants applauded the search-by-sketch approach for identifying frames of interest and found the resulting image-based questions and feedback to be of high educational value.

Surgment: Segmentation-enabled Semantic Search and Creation of Visual Question and Feedback to Support Video-Based Surgery Learning

TL;DR

Surgment addresses the need for active, visual-focused surgical learning by integrating a SegGPT+SAM segmentation pipeline with two key interfaces: search-by-mask for frame retrieval and a quiz-maker for image-based questions and feedback. The approach achieves a Dice score of 0.92 on lap chole datasets and enables high-education-value content validated by 11 expert surgeons, highlighting improvements over traditional, text-centric question generation. The study demonstrates the importance of human expert input in AI-assisted content creation, identifies UI and generalizability challenges, and points to future directions including voice interfaces, finer segmentation, and AR-enabled in-OR teaching. Collectively, Surgment offers a practical pathway to enhance preoperative preparation and surgical training through interactive, visual learning materials grounded in authentic operative scenes.

Abstract

Videos are prominent learning materials to prepare surgical trainees before they enter the operating room (OR). In this work, we explore techniques to enrich the video-based surgery learning experience. We propose Surgment, a system that helps expert surgeons create exercises with feedback based on surgery recordings. Surgment is powered by a few-shot-learning-based pipeline (SegGPT+SAM) to segment surgery scenes, achieving an accuracy of 92\%. The segmentation pipeline enables functionalities to create visual questions and feedback desired by surgeons from a formative study. Surgment enables surgeons to 1) retrieve frames of interest through sketches, and 2) design exercises that target specific anatomical components and offer visual feedback. In an evaluation study with 11 surgeons, participants applauded the search-by-sketch approach for identifying frames of interest and found the resulting image-based questions and feedback to be of high educational value.
Paper Structure (63 sections, 2 equations, 11 figures, 3 tables)

This paper contains 63 sections, 2 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: An example question and feedback provided by FP2 in the formative study. This is an option in a multiple-choice question they created asking "When is enough exposure for dissection". FP2 explained why this was the correct answer and linked the explanations with components of the surgery scene.
  • Figure 2: Overview of the Surgment platform. (A) The video panel displays an authentic lap chole surgery video, and the Frame Gallery (A.1) shows keyframes identified from the video. When a user clicks a keyframe, the search-by-mask canvas (A.2) expands for users to search for images by adjusting the size, shape, and positions of the polygon masks. The retrieved images are displayed to the right(A.3). (B) The Question Creation panel supports surgeons to create questions and feedback. Surgeons can select the images retrieved (A.3) in the previous step to create questions. Three question types are supported: MCQ (B.1), Extract a Component (B.2) and Draw a Path (B.2).
  • Figure 3: Segmentation result of UNet, SegGPT, and SAM on a single image. When two tools (clip and clip applicator) are adjacent in the image, the UNet and SegGPT models are not able to differentiate the two parts. SAM is able to distinguish the items, but cannot predict the classes of components.
  • Figure 4: Our proposed SegGPT+SAM pipeline has two steps. 1) First, it assigns each section segmented by SAM to a unique class predicted by SegGPT. Since there is a discrepancy in the segmentation results achieved by the two models, we use a "majority voting" approach, selecting the class that the majority of the pixels in the section belong to. 2) Second, we merge the adjacent sections that have the same class.
  • Figure 5: The light patches in the similarity matrix represent segments in the video that are visually consistent (a). We select the center for each patch as the keyframes, which are the peaks in the local average similarity score diagram (b).
  • ...and 6 more figures