Table of Contents
Fetching ...

From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation

Zhen Chen, Yihang Fu, Gabriel Madera, Mauro Giuffre, Serina Applebaum, Hyunjae Kim, Hua Xu, Qingyu Chen

TL;DR

This work tackles the scarcity of training data for medical multi-image understanding by mining license-permissive compound figures from PubMed Central. It introduces a five-stage, context-aware instruction-generation paradigm to convert compound figures and surrounding text into rich training data, enabling an MLLM (M3LLM) to perform composite reasoning across images, modalities, and time. A large-scale PMC-MI dataset and the PMC-MI-Bench benchmark underpin training and evaluation, with extensive experiments showing state-of-the-art results across multi-image VQA, single-image tasks, text-only QA, and multi-choice VQA, plus strong generalization to longitudinal chest X-ray analysis (MIMIC). The approach demonstrates robust cross-domain performance, data-efficient scaling, and potential clinical impact, while outlining future work to broaden modality coverage and refine clinical benchmarks for real-world deployment.

Abstract

Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.

From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation

TL;DR

This work tackles the scarcity of training data for medical multi-image understanding by mining license-permissive compound figures from PubMed Central. It introduces a five-stage, context-aware instruction-generation paradigm to convert compound figures and surrounding text into rich training data, enabling an MLLM (M3LLM) to perform composite reasoning across images, modalities, and time. A large-scale PMC-MI dataset and the PMC-MI-Bench benchmark underpin training and evaluation, with extensive experiments showing state-of-the-art results across multi-image VQA, single-image tasks, text-only QA, and multi-choice VQA, plus strong generalization to longitudinal chest X-ray analysis (MIMIC). The approach demonstrates robust cross-domain performance, data-efficient scaling, and potential clinical impact, while outlining future work to broaden modality coverage and refine clinical benchmarks for real-world deployment.

Abstract

Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.

Paper Structure

This paper contains 26 sections, 30 figures, 10 tables.

Figures (30)

  • Figure 1: Illustration of a compound figure example in PMC literature. This example, derived from PMC7029651, features a compound figure composed of multiple sub-images. The example highlights longitudinal patient records with radiology and histopathology images for a case of insulinoma located in the neck of the pancreas. It integrates the accompanying image caption, which describes the visual contents, along with inline text from the manuscript that references the compound figure. To fully understand this medical compound figure, it is essential to comprehend the rich visual content and associated textual information. This includes analyzing the spatial, cross-modal, and longitudinal relationships of the sub-images, particularly concerning the first CT scan.
  • Figure 2: Overview of the study for medical compound figure understanding and clinical validation. The framework integrates PMC-derived compound figure data. Through a five-stage, context-aware instruction generation paradigm, the proposed M$^{3}$LLM processes medical compound figures and paired texts. The core architecture of M$^3$LLM includes a Vision Transformer (ViT), a connector module for visual-to-text alignment, and a large language model (LLM) for clinical reasoning. On this basis, the context-aware instruction tuning enables efficient and accurate multi-image comprehension. Extensive evaluation is conducted on the curated PMC-MI-Bench, public benchmarks, and MIMIC clinical cases.
  • Figure 3:
  • Figure 4: Performance comparison on open-ended text generation tasks within the PMC-MI-Bench. We compare our M$^3$LLM against state-of-the-art general-purpose and specialized medical MLLMs across three question-answering task types: Multi-image VQA, Single-image VQA, and Text-only QA. Performance is evaluated using four standard text generation metrics: (a) BLEU@4, (b) ROUGE-L, (c) BERTScore, and (d) Semantic Textual Similarity (STS). The results consistently demonstrate the superior performance of M$^3$LLM across all evaluated tasks and metrics compared to the baseline models.
  • Figure 5: Comparison of LLM-as-a-judge assessment of our M$^3$LLM against state-of-the-art MLLMs. We conduct the assessment using GPT-4o as a judge across multiple tasks on the PMC-MI-Bench, including (a) the overall performance, (b) the multi-image VQA, (c) the single-image VQA, and (d) the text-only QA.
  • ...and 25 more figures