Table of Contents
Fetching ...

Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry

Wenjun Hou, Yi Cheng, Kaishuai Xu, Yan Hu, Wenjie Li, Jiang Liu

TL;DR

SCAN is a simple yet effective memory-augmented framework that leverages Multimodal LLMs to improve surgical context comprehension via Self-Contained Inquiry and achieves state-of-the-art performance, offering improved accuracy and robustness across various surgical scenarios.

Abstract

Comprehensively understanding surgical scenes in Surgical Visual Question Answering (Surgical VQA) requires reasoning over multiple objects. Previous approaches address this task using cross-modal fusion strategies to enhance reasoning ability. However, these methods often struggle with limited scene understanding and question comprehension, and some rely on external resources (e.g., pre-extracted object features), which can introduce errors and generalize poorly across diverse surgical environments. To address these challenges, we propose SCAN, a simple yet effective memory-augmented framework that leverages Multimodal LLMs to improve surgical context comprehension via Self-Contained Inquiry. SCAN operates autonomously, generating two types of memory for context augmentation: Direct Memory (DM), which provides multiple candidates (or hints) to the final answer, and Indirect Memory (IM), which consists of self-contained question-hint pairs to capture broader scene context. DM directly assists in answering the question, while IM enhances understanding of the surgical scene beyond the immediate query. Reasoning over these object-aware memories enables the model to accurately interpret images and respond to questions. Extensive experiments on three publicly available Surgical VQA datasets demonstrate that SCAN achieves state-of-the-art performance, offering improved accuracy and robustness across various surgical scenarios.

Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry

TL;DR

SCAN is a simple yet effective memory-augmented framework that leverages Multimodal LLMs to improve surgical context comprehension via Self-Contained Inquiry and achieves state-of-the-art performance, offering improved accuracy and robustness across various surgical scenarios.

Abstract

Comprehensively understanding surgical scenes in Surgical Visual Question Answering (Surgical VQA) requires reasoning over multiple objects. Previous approaches address this task using cross-modal fusion strategies to enhance reasoning ability. However, these methods often struggle with limited scene understanding and question comprehension, and some rely on external resources (e.g., pre-extracted object features), which can introduce errors and generalize poorly across diverse surgical environments. To address these challenges, we propose SCAN, a simple yet effective memory-augmented framework that leverages Multimodal LLMs to improve surgical context comprehension via Self-Contained Inquiry. SCAN operates autonomously, generating two types of memory for context augmentation: Direct Memory (DM), which provides multiple candidates (or hints) to the final answer, and Indirect Memory (IM), which consists of self-contained question-hint pairs to capture broader scene context. DM directly assists in answering the question, while IM enhances understanding of the surgical scene beyond the immediate query. Reasoning over these object-aware memories enables the model to accurately interpret images and respond to questions. Extensive experiments on three publicly available Surgical VQA datasets demonstrate that SCAN achieves state-of-the-art performance, offering improved accuracy and robustness across various surgical scenarios.

Paper Structure

This paper contains 27 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Given an image and a question, S$^2$Can generates memory via self-contained inquiry ($\rightarrow$) and answers the question ($\rightarrow$).
  • Figure 2: Illustration of our proposed S$^2$Can framework, which first generates memory and then utilizes it for VQA. Red spans provide highly-related information for the answers. Note that Indirect Memory can be used to enhance any relevant questions of the given image.
  • Figure 3: Performance of LLaVA-Med-v1.5, BLIP-3, and our S$^2$Can on different question types (e.g., Action or Location).
  • Figure 4: Two major causes of errors produced by S$^2$Can, i.e., wrong direct memory and wrong indirect memory. Case (a) and (b) are selected from the EndoVis-18-VQA and Cholec80-VQA datasets.