Table of Contents
Fetching ...

CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

Pooja Singh, Siddhant Ujjain, Tapan Kumar Gandhi, Sandeep Kumar

TL;DR

CrossMed addresses compositional generalization for medical vision-language systems by defining MAT triplets and unifying tasks under a four-option VQA format. It shows that related MAT training yields strong CG gains and enables cross-task transfer, with zero-overlap tests revealing the remaining generalization gap. The benchmark is demonstrated on CheXpert, SIIM-ACR, BraTS 2020, and MosMedData using LLaVA-Vicuna-7B and Qwen2-VL-7B, highlighting the superiority of multimodal LLMs in compositional reasoning. These results provide a scalable testbed for zero-shot, cross-task, and modality-agnostic generalization, with practical relevance for clinical AI deployment.

Abstract

Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.

CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging

TL;DR

CrossMed addresses compositional generalization for medical vision-language systems by defining MAT triplets and unifying tasks under a four-option VQA format. It shows that related MAT training yields strong CG gains and enables cross-task transfer, with zero-overlap tests revealing the remaining generalization gap. The benchmark is demonstrated on CheXpert, SIIM-ACR, BraTS 2020, and MosMedData using LLaVA-Vicuna-7B and Qwen2-VL-7B, highlighting the superiority of multimodal LLMs in compositional reasoning. These results provide a scalable testbed for zero-shot, cross-task, and modality-agnostic generalization, with practical relevance for clinical AI deployment.

Abstract

Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.

Paper Structure

This paper contains 13 sections, 1 theorem, 7 equations, 5 figures, 11 tables.

Key Result

Theorem 1

Let $f$ be a model trained on $n$ i.i.d. “Related” samples sharing two factors (e.g. $(M,A)$) with the test instance, denoting its empirical risk by and let $R(f)$ be the true risk over $(M,A,T)\sim P$. Then with probability $1-\delta$, In particular, if $\widehat{R}_{MA}(f)\le\varepsilon_{MA}$, then

Figures (5)

  • Figure 1: Classification accuracy across compositional generalization conditions: Related, Unrelated, with holding individual MAT factors (w/o Modality, w/o Anatomy, w/o Task), and All Data multi-task upper bound.
  • Figure 2: Illustration of compositional generalization under Related vs. Unrelated training splits. Top: A model trained on White Rabbit and Black Piglet images generalizes to an unseen Black Rabbit by recombining color and object attributes. Bottom: Training on MRI Spine and CT Lung enables correct interpretation of a novel CT Spine image, demonstrating generalization through shared modality–anatomy factors.
  • Figure 3: Pipeline for transforming a raw chest X-ray labeled “CARDIOMEGALY” into a CrossMed VQA sample: Raw Sample → Prompt Template Pool → Prompt Selection → Distractor Sampling → Final QA Format with MAT tags.
  • Figure 4: Four‐panel illustration of CrossMed’s MAT triplets: (Top‐Left) CheXpert chest X-ray classification, (Top‐Right) SIIM-ACR chest X-ray segmentation, (Bottom‐Left) BraTS MRI glioma-grade classification, (Bottom‐Right) BraTS MRI tumor segmentation.
  • Figure 5: Classification accuracy across X-ray, MRI, and CT tasks under varying training data fractions. Compositional generalization enables strong performance even with limited supervision.

Theorems & Definitions (1)

  • Theorem 1: Two‐Factor Generalization Bound