Table of Contents
Fetching ...

Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

Junkai Zhang, Bin Li, Shoujun Zhou, Yue Du

TL;DR

Experiments demonstrate that the HiCA-VQA framework outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions, especially achieving an 18 percent improvement in the F1 score.

Abstract

Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical MedVQA tasks and established datasets, However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels causing semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which obscures crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding.

Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

TL;DR

Experiments demonstrate that the HiCA-VQA framework outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions, especially achieving an 18 percent improvement in the F1 score.

Abstract

Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical MedVQA tasks and established datasets, However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels causing semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which obscures crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding.

Paper Structure

This paper contains 20 sections, 7 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure S1: A schematic diagram of a traditional hierarchical medical visual question answering framework 7. Medical images and fine-grained hierarchical medical questions are fed into an image encoder and a text encoder. The encoded features are then input into a Transformer-based fusion module for multi-modal feature integration, and finally an MLP classification layer is employed to predict the answer candidates for the corresponding medical question.
  • Figure S2: Overview of the proposed HiCA-VQA architecture. The framework comprises: (1) A hierarchical prompting module that generates prompts for questions at different levels. (2) An image encoder that encodes image features. (3) A text encoder that encodes questions and hierarchical prompts. (4) An Alignment Module is responsible for aligning image and prompt features. (5) Hierarchical Answer Decoders that fuse multi-modal features for final answer prediction.
  • Figure S3: Hierarchical questions overview: The questions are organized into three levels, representing a stepwise refinement of inquiries regarding the patient's medical imaging condition. The first two levels employ binary "Yes" or "No" response candidates, while the final level contains multiple-choice candidates primarily describing pathological attributes.
  • Figure S4: A schematic diagram of a hierarchical medical visual question answering framework. Medical images and fine-grained hierarchical medical questions are fed into an image encoder and a text encoder. The encoded features are then input into a Transformer-based fusion module for multi-modal feature integration, and finally an MLP classification layer is employed to predict the answer candidates for the corresponding medical question.