Table of Contents
Fetching ...

CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Qiangguo Jin, Xianyao Zheng, Hui Cui, Changming Sun, Yuqi Fang, Cong Cong, Ran Su, Leyi Wei, Ping Xuan, Junbo Wang

TL;DR

This work tackles Med-VQA by addressing cross-modal misalignment and the limits of fixed answer vocabularies. It proposes CMI-MTL, a three-module framework combining FVTA for fine-grained visual-text alignment via a QQ-Former, CIFR for cross-modal sequential fusion using Cross-Mamba blocks, and FFAE which leverages a T5 decoder to improve open-ended answer generation within a multi-task objective $\,\mathcal{L}=\mathcal{L}_{\text{cls}}+\alpha\mathcal{L}_{\mathrm{vtc}}+\beta\mathcal{L}_{\text{aux}}$. Experimental results across SLAKE, VQA-RAD, and OVQA show state-of-the-art performance and strong open-ended question handling, supported by interpretability analyses (Grad-CAM) and efficiency assessments comparing Mamba to Transformer architectures. The findings underscore the value of cross-modal cues and auxiliary open-ended supervision for robust medical visual understanding, with potential for future human-machine collaboration in clinical settings.

Abstract

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.

CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

TL;DR

This work tackles Med-VQA by addressing cross-modal misalignment and the limits of fixed answer vocabularies. It proposes CMI-MTL, a three-module framework combining FVTA for fine-grained visual-text alignment via a QQ-Former, CIFR for cross-modal sequential fusion using Cross-Mamba blocks, and FFAE which leverages a T5 decoder to improve open-ended answer generation within a multi-task objective . Experimental results across SLAKE, VQA-RAD, and OVQA show state-of-the-art performance and strong open-ended question handling, supported by interpretability analyses (Grad-CAM) and efficiency assessments comparing Mamba to Transformer architectures. The findings underscore the value of cross-modal cues and auxiliary open-ended supervision for robust medical visual understanding, with potential for future human-machine collaboration in clinical settings.

Abstract

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.

Paper Structure

This paper contains 10 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall architecture of CMI-MTL. The CMI-MTL consists of (a) Fine-grained visual-text feature alignment, (b) Cross-modal interleaved feature representation, and (c) Free-form answer enhanced multi-task learning.
  • Figure 2: Visual saliency maps for representative methods.