LaPA: Latent Prompt Assist Model For Medical Visual Question Answering

Tiancheng Gu; Kaicheng Yang; Dongnan Liu; Weidong Cai

LaPA: Latent Prompt Assist Model For Medical Visual Question Answering

Tiancheng Gu, Kaicheng Yang, Dongnan Liu, Weidong Cai

TL;DR

The paper tackles the Med-VQA challenge posed by small, domain-specific datasets and complex medical images by introducing LaPA, a latent prompt-assisted architecture. LaPA comprises a latent prompt generation module that aligns prompts with target answers, a multi-modal fusion block that integrates prompts with uni- and multi-modal features, and a prior knowledge fusion module utilizing a disease-organ knowledge graph via a graph neural network. The approach achieves state-of-the-art performance on VQA-RAD, SLAKE, and VQA-2019, with notable improvements over the prior ARL model while remaining parameter-efficient. These components collectively enable targeted extraction of clinical information and improved answer prediction, offering practical impact for aiding physicians in MRI/CT interpretation and related tasks. The authors also provide ablations and qualitative analyses, and outline future work to scale latent prompts in larger models for more complex inference.

Abstract

Medical visual question answering (Med-VQA) aims to automate the prediction of correct answers for medical images and questions, thereby assisting physicians in reducing repetitive tasks and alleviating their workload. Existing approaches primarily focus on pre-training models using additional and comprehensive datasets, followed by fine-tuning to enhance performance in downstream tasks. However, there is also significant value in exploring existing models to extract clinically relevant information. In this paper, we propose the Latent Prompt Assist model (LaPA) for medical visual question answering. Firstly, we design a latent prompt generation module to generate the latent prompt with the constraint of the target answer. Subsequently, we propose a multi-modal fusion block with latent prompt fusion module that utilizes the latent prompt to extract clinical-relevant information from uni-modal and multi-modal features. Additionally, we introduce a prior knowledge fusion module to integrate the relationship between diseases and organs with the clinical-relevant information. Finally, we combine the final integrated information with image-language cross-modal information to predict the final answers. Experimental results on three publicly available Med-VQA datasets demonstrate that LaPA outperforms the state-of-the-art model ARL, achieving improvements of 1.83%, 0.63%, and 1.80% on VQA-RAD, SLAKE, and VQA-2019, respectively. The code is publicly available at https://github.com/GaryGuTC/LaPA_model.

LaPA: Latent Prompt Assist Model For Medical Visual Question Answering

TL;DR

Abstract

Paper Structure (14 sections, 13 equations, 4 figures, 5 tables)

This paper contains 14 sections, 13 equations, 4 figures, 5 tables.

Introduction
Related Works
LaPA Model
Latent Prompt Generation Module
Multi-modal Fusion Block
Prior Knowledge Fusion Module
Training Details
Experiments and Results
Implementation Details
Datasets
Comparison Experiments
Ablation Study
Qualitative Analysis
Conclusion

Figures (4)

Figure 1: The overall structure of our proposed LaPA model. The input feature is denoted by a block with rounded corners, while the square-angled structure represents a module. The language and image pipelines are represented by green and blue modules, respectively. The final tokens in blue, green, and red correspond to the cross-modal image, language, and integrated information, respectively. For optimal viewing, it is recommended to zoom in for detailed examination.
Figure 2: The structure of the main modules in LaPA is illustrated as follows: (a), (b), and (c) represent the latent prompt generation module (Sec. \ref{['subsec: Latent Prompt Generation Module']}), the latent prompt fusion module (Sec. \ref{['subsec: Multi-modal Feature Fusion Module']}), and the prior knowledge fusion module (Sec. \ref{['subsec: Knowledge-graph-based Latent Prompt Analysis Module']}), respectively. For optimal visualization, it is recommended to zoom in for detailed examination.
Figure 3: Ablation on the $\theta$ and $\beta$.
Figure 4: Six examples of the LaPA model that use different modules to do the ablation study. Instances a, b, and c are extracted from the VQA-RAD dataset, whereas instances d, e, and f originate from the SLAKE dataset. Within the provided illustrations, responses are annotated with green to denote correctness and with red to signify erroneous predictions by the model. The GM., LF., and PF. are the abbreviations of the latent prompt generation module, latent prompt fusion module, and prior knowledge fusion module.

LaPA: Latent Prompt Assist Model For Medical Visual Question Answering

TL;DR

Abstract

LaPA: Latent Prompt Assist Model For Medical Visual Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (4)