From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation
Zhen Chen, Yihang Fu, Gabriel Madera, Mauro Giuffre, Serina Applebaum, Hyunjae Kim, Hua Xu, Qingyu Chen
TL;DR
This work tackles the scarcity of training data for medical multi-image understanding by mining license-permissive compound figures from PubMed Central. It introduces a five-stage, context-aware instruction-generation paradigm to convert compound figures and surrounding text into rich training data, enabling an MLLM (M3LLM) to perform composite reasoning across images, modalities, and time. A large-scale PMC-MI dataset and the PMC-MI-Bench benchmark underpin training and evaluation, with extensive experiments showing state-of-the-art results across multi-image VQA, single-image tasks, text-only QA, and multi-choice VQA, plus strong generalization to longitudinal chest X-ray analysis (MIMIC). The approach demonstrates robust cross-domain performance, data-efficient scaling, and potential clinical impact, while outlining future work to broaden modality coverage and refine clinical benchmarks for real-world deployment.
Abstract
Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.
