Table of Contents
Fetching ...

Advancing Surgical VQA with Scene Graph Knowledge

Kun Yuan, Manasi Kattel, Joel L. Lavanchy, Nassir Navab, Vinkle Srivastav, Nicolas Padoy

TL;DR

This work aims to advance visual question answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question–condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design.

Abstract

Modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with language capabilities is emerging as a necessity. Our work aims to advance Visual Question Answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question-condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. First, we propose a Surgical Scene Graph-based dataset, SSG-QA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. Our SSG-QA dataset provides a more complex, diverse, geometrically grounded, unbiased, and surgical action-oriented dataset compared to existing surgical VQA datasets. We then propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module (SIM), which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Our comprehensive analysis of the SSG-QA dataset shows that SSG-QA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-QA

Advancing Surgical VQA with Scene Graph Knowledge

TL;DR

This work aims to advance visual question answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question–condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design.

Abstract

Modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with language capabilities is emerging as a necessity. Our work aims to advance Visual Question Answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question-condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. First, we propose a Surgical Scene Graph-based dataset, SSG-QA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. Our SSG-QA dataset provides a more complex, diverse, geometrically grounded, unbiased, and surgical action-oriented dataset compared to existing surgical VQA datasets. We then propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module (SIM), which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Our comprehensive analysis of the SSG-QA dataset shows that SSG-QA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-QA
Paper Structure (27 sections, 1 equation, 10 figures, 7 tables)

This paper contains 27 sections, 1 equation, 10 figures, 7 tables.

Figures (10)

  • Figure 1: SSG-VQA dataset contains up to $50$ complex visual reasoning questions, compared to $2$ classification-based questions in the Cholec-VQA seenivasan2022surgical.
  • Figure 2: Pipeline of SSG-VQA construction. The dataset is constructed from the well-designed question engine, which takes the scene graph as input and changes the parameters of question templates to generate diverse question-answer pairs.
  • Figure 3: Pipeline of the SSG-VQA-Net. It requires three types of inputs, textual, visual, and scene knowledge. The textual and scene embeddings are fed into the SIM and generate refined scene embeddings. The visual embeddings are generated from the RoIAlign. Finally, we concat them to feed into the self-attention transformer to get the final answer. Here, G, H, A, and L represent class labels; $x_1$, $y_1$, $x_2$, and $y_2$ represent bounding box coordinates (G: gallbladder, H: hook; A: abdominal wall cavity; L: Liver).
  • Figure 4: Scene-embedded interaction module. It is a stack of layers of cross-attention and self-attention. The cross-attention modulates the scene embeddings based on the text queries, while the self-attention refines the scene embeddings .
  • Figure 5: SSG-VQA Dataset - location attribute: (a) we use triplet annotations for action-related question-answer generation; (b) we include a spatial attribute for each object, e.g., top-right, bottom-left; (c) we use tool and anatomy bounding boxes to generate question-answer pairs.
  • ...and 5 more figures