Table of Contents
Fetching ...

Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

Xieji Li, Siyuan Yan, Yingsheng Liu, H. Peter Soyer, Monika Janda, Victoria Mar, Zongyuan Ge

TL;DR

This work tackles two critical gaps in medical vision-language pretraining: noisy web-derived data and underutilized long-form clinical knowledge. It introduces MAGEN, a three-agent system that generates knowledge-rich captions and verifies them via retrieval-augmented reasoning, to augment dermatology image-text data. It then presents O-MAKE, an ontology-guided, multi-aspect pretraining framework that decomposes long texts, aligns multiple knowledge representations at global and patch levels, and uses soft-label learning across hierarchically related diseases. Tested on eight dermatology datasets, the approach achieves state-of-the-art zero-shot disease classification and cross-modal retrieval, with substantial gains on rare diseases and long-tail settings. The resulting Derm1M-AgentAug dataset and the modular MAGEN-O-MAKE pipeline offer a scalable blueprint for advancing medical VLP beyond dermatology to other specialties.

Abstract

Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.

Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

TL;DR

This work tackles two critical gaps in medical vision-language pretraining: noisy web-derived data and underutilized long-form clinical knowledge. It introduces MAGEN, a three-agent system that generates knowledge-rich captions and verifies them via retrieval-augmented reasoning, to augment dermatology image-text data. It then presents O-MAKE, an ontology-guided, multi-aspect pretraining framework that decomposes long texts, aligns multiple knowledge representations at global and patch levels, and uses soft-label learning across hierarchically related diseases. Tested on eight dermatology datasets, the approach achieves state-of-the-art zero-shot disease classification and cross-modal retrieval, with substantial gains on rare diseases and long-tail settings. The resulting Derm1M-AgentAug dataset and the modular MAGEN-O-MAKE pipeline offer a scalable blueprint for advancing medical VLP beyond dermatology to other specialties.

Abstract

Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.

Paper Structure

This paper contains 12 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: MAGEN consistently improves zero-shot performance across diverse VLP approaches. Left: disease classification accuracy averaged across datasets. Right: cross-modal retrieval. All models trained on Derm1M show substantial gains when augmented with MAGEN-generated captions.
  • Figure 2: Left: Multi-Agent Data Generation (MAGEN) system with three key components: Captioning Agent (finetuned MLLM), Summary Agent, and Verification Agent. Right: Ontology-based multi-knowledge pretraining framework with (a) multi-aspect knowledge encoding, (b) ontology-guided weighting via similarity computation, (c) multi-knowledge image alignment ($\mathcal{L}^{MKIA}_{i2t}$ and $\mathcal{L}^{MKIA}_{t2i}$), and (d) fine-grained alignment ($\mathcal{L}^{FGA}$).
  • Figure 3: Comparison between (a) traditional contrastive learning and (b) our ontology-based multi-knowledge approach.
  • Figure 4: T-SNE Visualization of Learned Visual Representations. Comparison of image embeddings from vision encoders on the top-20 classes in SD128 dataset.
  • Figure 5: Qualitative Examples of Multi-Agent Data Generation. MAGEN transforms low-quality web-crawled captions (Origin) into knowledge-enriched descriptions through two stages: Captioning Agent generates morphology-focused descriptions guided by a foundation model(Initial), followed by Verification Agent's RAG-based refinement for accurate diagnoses (Verified).