ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models
Duy M. H. Nguyen, Nghiem T. Diep, Trung Q. Nguyen, Hoang-Bao Le, Tai Nguyen, Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, James Zou, Daniel Sonntag, Mathias Niepert
TL;DR
ExGra-Med tackles the data-hungry nature of autoregressive medical vision-language pre-training by introducing a triplet, multi-graph alignment framework that jointly aligns images, instruction responses, and extended-context captions. The method builds three modality graphs (visual, original caption, and extended caption), and learns a shared barycenter graph to enforce structure-aware correspondences via a scalable SGA objective, trained with black-box gradient estimation through IMLE. Theoretical results establish that the proposed graph distance is a metric and supports geodesics in the space of structured graphs, while experiments show ExGra-Med matching LLaVA-Med with only 10% of data and outperforming multiple med-MLLM baselines on VQA, visual chatbot tasks, and zero-shot image classification across 23 datasets. The work offers a data-efficient, scalable approach to medical vision-language grounding, with strong practical implications for deploying medical MLLMs where labeled data are costly or scarce.
Abstract
State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLaVA-Med and BioMedGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce ExGra-Med, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMA-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, ExGra-Med matches LLaVA-Med's performance using just 10% of the pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance. It also outperforms strong baselines like BioMedGPT and RadFM on visual chatbot and zero-shot classification tasks, demonstrating its promise for efficient, high-quality vision-language integration in medical AI.
