Table of Contents
Fetching ...

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Kangyu Zhu, Peng Xia, Yun Li, Hongtu Zhu, Sheng Wang, Huaxiu Yao

TL;DR

MMedPO addresses factuality gaps in medical vision-language systems by introducing clinical-aware multimodal preference optimization. It constructs dispreferred data via hallucinations and lesion-focused image perturbations, then weights preference samples using clinical relevance scores from multiple Med-LLMs and lesion-detection confidence. The approach yields substantial improvements on Med-VQA and radiology report-generation tasks and remains compatible with diverse Med-LVLM backbones. By prioritizing clinically meaningful samples and lesion-focused understanding, MMedPO offers a principled path toward more reliable, interpretable medical AI systems.

Abstract

The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in https://github.com/aiming-lab/MMedPO.

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

TL;DR

MMedPO addresses factuality gaps in medical vision-language systems by introducing clinical-aware multimodal preference optimization. It constructs dispreferred data via hallucinations and lesion-focused image perturbations, then weights preference samples using clinical relevance scores from multiple Med-LLMs and lesion-detection confidence. The approach yields substantial improvements on Med-VQA and radiology report-generation tasks and remains compatible with diverse Med-LVLM backbones. By prioritizing clinically meaningful samples and lesion-focused understanding, MMedPO offers a principled path toward more reliable, interpretable medical AI systems.

Abstract

The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in https://github.com/aiming-lab/MMedPO.

Paper Structure

This paper contains 31 sections, 3 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: An illustration of preference data pair. The dispreferred response contains nonfactual and clinically meaningless content.
  • Figure 2: The overview of MMedPO outlines a comprehensive framework consisting of multimodal preference data curation, a quantified preference scoring module, and clinical-aware preference optimization. For data curation, the hallucinated text response and localized noisy images are joint constructed as preference data. Then the clinical relevance score is obtained through a multi-agent collaboration system and visual tools. Finally, these scores, serve as weights for the clinical-aware preference optimization.
  • Figure 3: Comparison of the effectiveness of different preference curation strategies. "stage 1": generating hallucinated medical responses; "stage 2": adding noise to localized lesion regions; "stage 1+2": merged preference data. We report the average score on each dataset.
  • Figure 4: Analysis of compatibility using LLaVA-Med++ as the backbone model. Averaged metrics across datasets are presented.
  • Figure 5: Visualization of attention map of image tokens. The red box region is labeled with the attentions that are enhanced.
  • ...and 5 more figures