Table of Contents
Fetching ...

PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining

Cheng Liang, Chaoyi Wu, Weike Zhao, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR

PhenoLIP addresses the gap in medical vision–language pretraining by grounding models in a large-scale phenotype ontology. It introduces PhenoKG, a phenotype-centric multimodal knowledge graph, and a two-stage PhenoLIP framework that first learns ontology-informed phenotype embeddings and then distills this knowledge into a vision–language model. A dedicated PhenoBench benchmark enables expert-verified evaluation of phenotype recognition and cross-modal retrieval. Across zero-shot, retrieval, and linear-probing tasks, PhenoLIP consistently surpasses strong biomedical VLM baselines, demonstrating the value of ontology priors for more accurate and interpretable medical image understanding.

Abstract

Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.

PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining

TL;DR

PhenoLIP addresses the gap in medical vision–language pretraining by grounding models in a large-scale phenotype ontology. It introduces PhenoKG, a phenotype-centric multimodal knowledge graph, and a two-stage PhenoLIP framework that first learns ontology-informed phenotype embeddings and then distills this knowledge into a vision–language model. A dedicated PhenoBench benchmark enables expert-verified evaluation of phenotype recognition and cross-modal retrieval. Across zero-shot, retrieval, and linear-probing tasks, PhenoLIP consistently surpasses strong biomedical VLM baselines, demonstrating the value of ontology priors for more accurate and interpretable medical image understanding.

Abstract

Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
Paper Structure (46 sections, 6 equations, 10 figures, 11 tables)

This paper contains 46 sections, 6 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: The figure illustrates our core methods PhenoKG and PhenoLIP, from left to right: (a) macro-level visualization of PhenoKG, the first large-scale phenotype-centric multimodal knowledge graph that hierarchically organizes diverse anatomical systems; (b) Unified multimodal integration that aligns phenotype images, rich visual descriptions, and structured phenotype ontology knowledge into a single graph; and (c)PhenoLIP, our phenotype knowledge-enhanced vision–language pretraining framework, which first learns a structured phenotype embedding from the ontology and then guide the vision-language pretraining via distillation.
  • Figure 2: Overview of the PhenoKG construction pipeline and the PhenoLIP training process. Left: PhenoKG Construction. This illustrates the four-stage pipeline for building our multimodal knowledge graph: (1) PhenoKG Initialization, (2) Crawling image-caption pairs from the PMC-OA database using phenotype keywords, (3) Preprocessing images (via clustering-based filtering and subfigure detection) and text (via LLM-based refinement), (4) Aligning subfigures with their corresponding descriptions. Right: PhenoLIP Pretraining. This depicts our two-stage pretraining framework. First, a knowledge encoder is trained on PhenoKG's textual ontology to learn a structured phenotype embedding. Second, during vision-language pretraining, this frozen knowledge encoder serves as a teacher to distill its structured knowledge into the VLM's text encoder, complementing the primary image-text contrastive alignment.
  • Figure 3: Comparison of PhenoLIP and BiomedCLIP on phenotype retrieval from PhenoBench. For each phenotype query (left), we show the top-10 retrieved images from both models. Correct retrievals are highlighted with red boxes.
  • Figure 4: (left) Distribution of phenotype categories in PhenoKG. The dataset spans a broad range of anatomical systems, with cardiovascular, nervous, and musculoskeletal phenotypes among the most frequently represented. (right) Distribution of caption lengths (in tokens) in PhenoKG. Most captions are comprehensive, with a mean of 63 tokens, indicating that the image–text pairs provide rich and detailed clinical descriptions.
  • Figure 5: (left) Word cloud of the caption corpus in PhenoKG. Frequent terms include phenotype names, anatomical structures, and clinical descriptors, reflecting the diverse and fine-grained medical semantics captured in the dataset. (right) Frequency distribution of the top-15 most common phenotypes in PhenoKG.
  • ...and 5 more figures