Table of Contents
Fetching ...

Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning

Zishan Gu, Fenglin Liu, Changchang Yin, Ping Zhang

TL;DR

This work tackles the challenge of zero-shot multimodal reasoning in radiology DVQA by introducing MultiMedRes, a learner-agent framework that decomposes complex medical difference questions, iteratively queries domain-specific experts, and integrates their knowledge to generate accurate answers. By combining an LLM-based learner with specialized classifiers for Abnormality, Presence, View, Location, Type, and Level questions, the approach achieves state-of-the-art zero-shot performance on the MIMIC-Diff-VQA DVQA task, sometimes beating fully supervised methods. The method demonstrates compatibility with multiple LLMs, enhances vision-language model performance through dialogue augmentation, and reduces bias across age and gender groups. These results indicate a practical pathway for incorporating domain-specific expertise into zero-shot multimodal medical reasoning, offering interpretable, collaborative AI support for radiologists.

Abstract

The adoption of large language models (LLMs) in healthcare has attracted significant research interest. However, their performance in healthcare remains under-investigated and potentially limited, due to i) they lack rich domain-specific knowledge and medical reasoning skills; and ii) most state-of-the-art LLMs are unimodal, text-only models that cannot directly process multimodal inputs. To this end, we propose a multimodal medical collaborative reasoning framework \textbf{MultiMedRes}, which incorporates a learner agent to proactively gain essential information from domain-specific expert models, to solve medical multimodal reasoning problems. Our method includes three steps: i) \textbf{Inquire}: The learner agent first decomposes given complex medical reasoning problems into multiple domain-specific sub-problems; ii) \textbf{Interact}: The agent then interacts with domain-specific expert models by repeating the ``ask-answer'' process to progressively obtain different domain-specific knowledge; iii) \textbf{Integrate}: The agent finally integrates all the acquired domain-specific knowledge to accurately address the medical reasoning problem. We validate the effectiveness of our method on the task of difference visual question answering for X-ray images. The experiments demonstrate that our zero-shot prediction achieves state-of-the-art performance, and even outperforms the fully supervised methods. Besides, our approach can be incorporated into various LLMs and multimodal LLMs to significantly boost their performance.

Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning

TL;DR

This work tackles the challenge of zero-shot multimodal reasoning in radiology DVQA by introducing MultiMedRes, a learner-agent framework that decomposes complex medical difference questions, iteratively queries domain-specific experts, and integrates their knowledge to generate accurate answers. By combining an LLM-based learner with specialized classifiers for Abnormality, Presence, View, Location, Type, and Level questions, the approach achieves state-of-the-art zero-shot performance on the MIMIC-Diff-VQA DVQA task, sometimes beating fully supervised methods. The method demonstrates compatibility with multiple LLMs, enhances vision-language model performance through dialogue augmentation, and reduces bias across age and gender groups. These results indicate a practical pathway for incorporating domain-specific expertise into zero-shot multimodal medical reasoning, offering interpretable, collaborative AI support for radiologists.

Abstract

The adoption of large language models (LLMs) in healthcare has attracted significant research interest. However, their performance in healthcare remains under-investigated and potentially limited, due to i) they lack rich domain-specific knowledge and medical reasoning skills; and ii) most state-of-the-art LLMs are unimodal, text-only models that cannot directly process multimodal inputs. To this end, we propose a multimodal medical collaborative reasoning framework \textbf{MultiMedRes}, which incorporates a learner agent to proactively gain essential information from domain-specific expert models, to solve medical multimodal reasoning problems. Our method includes three steps: i) \textbf{Inquire}: The learner agent first decomposes given complex medical reasoning problems into multiple domain-specific sub-problems; ii) \textbf{Interact}: The agent then interacts with domain-specific expert models by repeating the ``ask-answer'' process to progressively obtain different domain-specific knowledge; iii) \textbf{Integrate}: The agent finally integrates all the acquired domain-specific knowledge to accurately address the medical reasoning problem. We validate the effectiveness of our method on the task of difference visual question answering for X-ray images. The experiments demonstrate that our zero-shot prediction achieves state-of-the-art performance, and even outperforms the fully supervised methods. Besides, our approach can be incorporated into various LLMs and multimodal LLMs to significantly boost their performance.
Paper Structure (29 sections, 9 figures, 6 tables)

This paper contains 29 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Medical image comparison vs. general image comparison.
  • Figure 2: Illustration of the expert consultation.
  • Figure 3: The proposed MultiMedRes framework. Upon receiving questions comparing two images, the learner agent employs an iterative approach, generating questions related to either the main image or the reference image, before consulting the appropriate domain expert (i.e., specialists). Upon collecting sufficient information, the learner agent is prompted to cease question generation and integrate the information to provide a zero-shot prediction.
  • Figure 4: Performance comparison for vision LLMs with or without our MultiMedRes. As we can see, our method significantly boosts the performance of vision LLMs across all metrics.
  • Figure 5: Bleu-4, METEOR, ROUGE_L and CIDEr score of the few-shot prediction generated by EKAID with respect to various ratios of labeled data (i.e., question-answer pairs) for training. The comparison is between the model trained with or without the augmented training data. The differences at different ratios are depicted using a polyline and the scales are indicated on the right y-axis.
  • ...and 4 more figures