Table of Contents
Fetching ...

MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

Jiawei Chen, Dingkang Yang, Yue Jiang, Yuxuan Lei, Lihua Zhang

TL;DR

The paper tackles data scarcity and transferability in medical VQA by reframing VQA as a generative task and introducing a unified Joint Text-Multimodal (JTM) encoder. MISS combines a ViT-based image encoder, a JTM encoder, and a text decoder, and features a Transfer-and-Caption (TransCap) component that uses LLMs to generate captions for unimodal data, expanding multimodal training data. During fine-tuning, the model generates natural-language answers, enabling deployment in real-world medical settings without candidate-answer pools. Experiments on VQA-RAD and SLAKE show competitive accuracy with far less multimodal data, highlighting the practicality and effectiveness of a generative Med-VQA approach.

Abstract

Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using Large Language Models (LLMs), enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.

MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

TL;DR

The paper tackles data scarcity and transferability in medical VQA by reframing VQA as a generative task and introducing a unified Joint Text-Multimodal (JTM) encoder. MISS combines a ViT-based image encoder, a JTM encoder, and a text decoder, and features a Transfer-and-Caption (TransCap) component that uses LLMs to generate captions for unimodal data, expanding multimodal training data. During fine-tuning, the model generates natural-language answers, enabling deployment in real-world medical settings without candidate-answer pools. Experiments on VQA-RAD and SLAKE show competitive accuracy with far less multimodal data, highlighting the practicality and effectiveness of a generative Med-VQA approach.

Abstract

Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using Large Language Models (LLMs), enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
Paper Structure (18 sections, 5 equations, 5 figures, 3 tables)

This paper contains 18 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Pretraining (a) and Finetuning (b) of our proposed method. We propose a pretraining and finetuning framework Miss for Med-VQA tasks which is composed of an image encoder, a JTM encoder, and a text decoder. ITC, ITM, and MLM Learning are used for pretraining. In the finetuning stage, the joint feature interacts with tokenized answers for MLM Learning.
  • Figure 2: Transfer and Caption unimodal images. We construct image descriptions based on image attributes and ChatGPT.
  • Figure 3: Examples of constructing new image-text pair by TransCap. Discrete image attribute information is converted to image descriptions end-to-end by ChatGPT.
  • Figure 4: Comparison of data from MedICat Dataset and image-text pair data generated through TransCap. Image-text pairs generated by the TransCap method contain less noise and are more humanized in terms of both image and caption.
  • Figure 5: Answers of our method and the ground truth (Label).