Table of Contents
Fetching ...

SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Shi Li, Vinkle Srivastav, Nicolas Chanel, Saurav Sharma, Nabani Banik, Lorenzo Arboit, Kun Yuan, Pietro Mascagni, Nicolas Padoy

Abstract

Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy -- Perception, Assessment, and Reasoning -- spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Abstract

Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy -- Perception, Assessment, and Reasoning -- spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

Paper Structure

This paper contains 39 sections, 23 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Architecture Overview of Our Proposed Model SurgTEMP. The sampled video frames first go through a feature extraction pipeline containing visual encoder, multi-modal projector and spatial pooling to obtain visual tokens $\mathbf{X}_s$. Textual input is processed through the textual tokenizer to yield textual tokens $\mathbf{T}_f$. Our proposed TEMP constructor takes tokens from both modalities to construct a hierarchical visual memory bank including spatial and temporal granularity. Then an LLM is employed as backbone to generate answers conditioned on visual input and textual query.
  • Figure 2: Hierarchical overview of the CholeVidQA-32K dataset. The 11 tasks are categorized into three capability levels: Perception (basic surgical scene understanding), Assessment (surgical assessment for multiple ends), and Reasoning (complex surgical scene analysis). In the question and answer examples, text indicating task-specific characteristics is shown in red, while descriptive elements are highlighted in blue.
  • Figure 3: Illustration of our proposed TEMP module. It contains three processing steps. First, multi-level text-visual attention maps are computed. Second, the spatial memory bank is constructed by selecting frames based on frame-level attention and reweighting visual tokens with the patch-level attention map. Third, the temporal memory bank is formed by temporal pooling and reweighting with the frame-level attention map.
  • Figure 4: Curation pipeline of the proposed dataset CholeVidQA-32K.
  • Figure 5: Comprehensive evaluation pipeline for CholeVidQA-32K assessment framework incorporating three complementary evaluation methodologies for robust performance measurement.
  • ...and 9 more figures