Table of Contents
Fetching ...

ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

Duy M. H. Nguyen, Nghiem T. Diep, Trung Q. Nguyen, Hoang-Bao Le, Tai Nguyen, Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, James Zou, Daniel Sonntag, Mathias Niepert

TL;DR

ExGra-Med tackles the data-hungry nature of autoregressive medical vision-language pre-training by introducing a triplet, multi-graph alignment framework that jointly aligns images, instruction responses, and extended-context captions. The method builds three modality graphs (visual, original caption, and extended caption), and learns a shared barycenter graph to enforce structure-aware correspondences via a scalable SGA objective, trained with black-box gradient estimation through IMLE. Theoretical results establish that the proposed graph distance is a metric and supports geodesics in the space of structured graphs, while experiments show ExGra-Med matching LLaVA-Med with only 10% of data and outperforming multiple med-MLLM baselines on VQA, visual chatbot tasks, and zero-shot image classification across 23 datasets. The work offers a data-efficient, scalable approach to medical vision-language grounding, with strong practical implications for deploying medical MLLMs where labeled data are costly or scarce.

Abstract

State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLaVA-Med and BioMedGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce ExGra-Med, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMA-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, ExGra-Med matches LLaVA-Med's performance using just 10% of the pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance. It also outperforms strong baselines like BioMedGPT and RadFM on visual chatbot and zero-shot classification tasks, demonstrating its promise for efficient, high-quality vision-language integration in medical AI.

ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models

TL;DR

ExGra-Med tackles the data-hungry nature of autoregressive medical vision-language pre-training by introducing a triplet, multi-graph alignment framework that jointly aligns images, instruction responses, and extended-context captions. The method builds three modality graphs (visual, original caption, and extended caption), and learns a shared barycenter graph to enforce structure-aware correspondences via a scalable SGA objective, trained with black-box gradient estimation through IMLE. Theoretical results establish that the proposed graph distance is a metric and supports geodesics in the space of structured graphs, while experiments show ExGra-Med matching LLaVA-Med with only 10% of data and outperforming multiple med-MLLM baselines on VQA, visual chatbot tasks, and zero-shot image classification across 23 datasets. The work offers a data-efficient, scalable approach to medical vision-language grounding, with strong practical implications for deploying medical MLLMs where labeled data are costly or scarce.

Abstract

State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLaVA-Med and BioMedGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data. To address this, we introduce ExGra-Med, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMA-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization. Empirically, ExGra-Med matches LLaVA-Med's performance using just 10% of the pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data performance. It also outperforms strong baselines like BioMedGPT and RadFM on visual chatbot and zero-shot classification tasks, demonstrating its promise for efficient, high-quality vision-language integration in medical AI.
Paper Structure (41 sections, 2 theorems, 43 equations, 19 figures, 18 tables)

This paper contains 41 sections, 2 theorems, 43 equations, 19 figures, 18 tables.

Key Result

Proposition 1

For any two graphs ${\mathcal{G}}_1$ and ${\mathcal{G}}_2$ in the structured graph space ${\mathbb{S}}({\mathcal{F}})$, described respectively by their mixing measure $\mu_1 = \sum_{i=1}^M w_{1i} \delta_{(f_{1i},s_{1i})}$ and $\mu_2 = \sum_{j=1}^N w_{2j} \delta_{(f_{2j},s_{2j})}$, it holds $d_{\text

Figures (19)

  • Figure 1: Our ExGra-Med versus LLaVA-Med across varying instruction-following (IF) pre-training data sizes, highlighting the data-hungry behavior of auto-regressive modeling. Both models are fine-tuned on the same VQA-RAD training set after the pre-training stage at each IF rate. At 100% IF pre-training, ExGra-Med and LLaVA-Med are benchmarked against other state-of-the-art models, all fine-tuned on the same VQA-RAD training set (except GPT-4, which is evaluated without fine-tuning). Circle radius represents the number of model parameters.
  • Figure 2: Illustration for creating the extended context instruction-following data powered by GPT-4o.
  • Figure 3: Overview of ExGra-Med: The large language model $g_{\sigma}$ and the projector $h_{\phi}$ are trained jointly by aligning a triplet of modalities - input image, instruction-following data, and extended captions - through a structure-aware multigraph alignment (Eq.(\ref{['eq_GM']})). This alignment operates over graphs $\mathcal{G}_v$, $\mathcal{G}a$, and $\mathcal{G}{ae}$, representing the visual, instruction, and extended textual information, respectively, via a shared barycenter graph. The entire model is optimized end-to-end using modern black-box gradient estimation techniques to enable efficient learning across modalities niepert2021implicitminervini2023adaptive.
  • Figure 4: ExGra-Med performance on $23$ zero-shot image classification tasks within three data modalities.
  • Figure 5: Instructions provided to the system for analyzing the quality of answers based on different criteria and generating a revised response in JSON format.
  • ...and 14 more figures

Theorems & Definitions (4)

  • Definition 1: Space of all structured graphs
  • Proposition 1: Equality relation
  • Definition 2: Length and geodesic spaces
  • Lemma 1