Table of Contents
Fetching ...

Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

Tongkun Su, Jun Li, Xi Zhang, Haibo Jin, Hao Chen, Qiong Wang, Faqin Lv, Baoliang Zhao, Yin Hu

TL;DR

This paper utilizes Visual Question Answering for multimodal pre-training to guide the framework focusing on targeted pathological features, and proposes a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy.

Abstract

Multimodal pre-training demonstrates its potential in the medical domain, which learns medical visual representations from paired medical reports. However, many pre-training tasks require extra annotations from clinicians, and most of them fail to explicitly guide the model to learn the desired features of different pathologies. In this paper, we utilize Visual Question Answering (VQA) for multimodal pre-training to guide the framework focusing on targeted pathological features. We leverage descriptions in medical reports to design multi-granular question-answer pairs associated with different diseases, which assist the framework in pre-training without requiring extra annotations from experts. We also propose a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy. This narrows the vision-language gap and facilitates modality alignment. Our framework is applied to four downstream tasks: report generation, classification, segmentation, and detection across five datasets. Extensive experiments demonstrate the superiority of our framework compared to other state-of-the-art methods. Our code is available at https://github.com/MoramiSu/QFT-MICCAI2024.

Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

TL;DR

This paper utilizes Visual Question Answering for multimodal pre-training to guide the framework focusing on targeted pathological features, and proposes a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy.

Abstract

Multimodal pre-training demonstrates its potential in the medical domain, which learns medical visual representations from paired medical reports. However, many pre-training tasks require extra annotations from clinicians, and most of them fail to explicitly guide the model to learn the desired features of different pathologies. In this paper, we utilize Visual Question Answering (VQA) for multimodal pre-training to guide the framework focusing on targeted pathological features. We leverage descriptions in medical reports to design multi-granular question-answer pairs associated with different diseases, which assist the framework in pre-training without requiring extra annotations from experts. We also propose a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy. This narrows the vision-language gap and facilitates modality alignment. Our framework is applied to four downstream tasks: report generation, classification, segmentation, and detection across five datasets. Extensive experiments demonstrate the superiority of our framework compared to other state-of-the-art methods. Our code is available at https://github.com/MoramiSu/QFT-MICCAI2024.
Paper Structure (15 sections, 5 equations, 4 figures, 5 tables)

This paper contains 15 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example of the breast VQA design. [Empty] indicates that this question will not be treated as pre-training data.
  • Figure 2: Overview of our framework. The image and text features extracted by the visual and text encoders are aligned by the quasi-textual Feature transformer module with two contrastive learning tasks. Then the quasi-textual features are fed to the text generator to generate answers based on the questions during pre-training. TE domain, VE domain, QFT domain denote the latent space of global textual features $T_i$, global visual features $V_i$ and quasi-textual features $Q_i$, respectively.
  • Figure 3: Performance on visual recognition tasks. We compare our method with GLoRIA GLoRIA, MGCAMGCA and MRM MRM. We only report detection results on ResNet since we use YOLOv3YOLOv3 as our backbone, which is based on the convolutional neural network. Despite being pre-trained on a relatively small dataset, our method demonstrates balanced and nearly the best performance across various tasks, while other methods exhibit some shortcomings. Numerical details are in the supplementary materials.
  • Figure 4: Examples of visual recognition tasks. Our method achieves superior performance in both classification, detection, and segmentation compared to GLoRIA.