Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

Long Bai; Guankun Wang; Mobarakol Islam; Lalithkumar Seenivasan; An Wang; Hongliang Ren

Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

Long Bai, Guankun Wang, Mobarakol Islam, Lalithkumar Seenivasan, An Wang, Hongliang Ren

TL;DR

The proposed surgical visual question localized-answering (VQLA) approach can serve as an effective tool for assisting surgical education, patient care, and enhancing surgical outcomes and can effectively combat real-world image corruption.

Abstract

Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making, enabling doctors to extract understanding from clinical images and videos. In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions. However, the inability of VQA models to visually indicate the regions of interest corresponding to the given questions results in incomplete comprehension of the surgical scene. To tackle this, we propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images. Furthermore, to address the strong demand for safety in surgical scenarios and potential corruptions in image acquisition and transmission, we propose a novel approach called Calibrated Co-Attention Gated Vision-Language (C$^2$G-ViL) embedding to integrate and align multimodal information effectively. Additionally, we leverage the adversarial sample-based contrastive learning strategy to boost our performance and robustness. We also extend our EndoVis-18-VQLA and EndoVis-17-VQLA datasets to broaden the scope and application of our data. Extensive experiments on the aforementioned datasets demonstrate the remarkable performance and robustness of our solution. Our solution can effectively combat real-world image corruption. Thus, our proposed approach can serve as an effective tool for assisting surgical education, patient care, and enhancing surgical outcomes.

Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

TL;DR

Abstract

G-ViL) embedding to integrate and align multimodal information effectively. Additionally, we leverage the adversarial sample-based contrastive learning strategy to boost our performance and robustness. We also extend our EndoVis-18-VQLA and EndoVis-17-VQLA datasets to broaden the scope and application of our data. Extensive experiments on the aforementioned datasets demonstrate the remarkable performance and robustness of our solution. Our solution can effectively combat real-world image corruption. Thus, our proposed approach can serve as an effective tool for assisting surgical education, patient care, and enhancing surgical outcomes.

Paper Structure (25 sections, 15 equations, 7 figures, 10 tables)

This paper contains 25 sections, 15 equations, 7 figures, 10 tables.

Introduction
Related Work
Grounded VQA in Computer Vision
Grounded VQA in the Medical Domain
Methodology
Preliminaries
VisualBERT
Multi-head Attention
Adversarial Examples
Feature Extraction
C$^2$G-ViL Embedding
Co-Attention Cross-Model Interaction
Multimodal Collaborated Calibration
Global Contextual Calibration
Gated Fusion
...and 10 more sections

Figures (7)

Figure 1: Comparison of the conventional VQA and our VQLA model. By providing the answer to 'What' and 'Where’, we can help learners to better infer 'Why’ and achieve a better understanding of surgical scenes.
Figure 2: Robustness results of Surgical-VQLA bai2023surgical for different types of samples at each severity level. The four types of samples include the tissue being operated on, the current state of the instruments, the location of the instruments in the operating area, and the identification of the instruments. We can observe that as the severity level of corruption increases, the performance of the model significantly decreases.
Figure 3: The overall network architecture of our Surgical-VQLA++ framework. The network comprises a visual feature extractor, a customizedly trained tokenizer, C$^2$G-ViL embedding module (embedding setup, co-attention cross-model interaction, multimodal collaborated calibration, global contextual calibration, gated fusion), per-trained DeiT backbone, and the parallel prediction heads for question answering and answer localization. $\odot$, $\otimes$, and $\oplus$ represent the element-wise dot product, cross product, and summation operation, respectively. Dim Concat denotes dimensional concatenation.
Figure 4: Overview of the adversarial contrastive training strategy. The visual and text embedding are perturbed separately. After being propagated through the embedding fusion and pre-trained DeiT backbone, the contrastive loss is applied to the clean and perturbed feature embedding.
Figure 5: Qualitative comparison of our Surgical-VQLA++ against SOTA solutions. The color of the detection bounding box corresponds to the color of the answers, in which red represents the ground truth and light blue represents our proposed solution. On the four types of question-answer pairs (tissue, action, location, instrument), our method achieves the best performance on both answering and localization tasks.
...and 2 more figures

Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

TL;DR

Abstract

Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

Authors

TL;DR

Abstract

Table of Contents

Figures (7)