Table of Contents
Fetching ...

MedFILIP: Medical Fine-grained Language-Image Pre-training

Xinjie Liang, Xiangyu Li, Fanding Li, Jie Jiang, Qing Dong, Wei Wang, Kuanquan Wang, Suyu Dong, Gongning Luo, Shuo Li

TL;DR

MedFILIP tackles the challenge of fine-grained medical image-language understanding by decoupling rich textual reports into precise disease triplets using a GPT-based information extractor (GPT-IE), linking categories to image-specific visual attributes through an image-specific knowledge injector (IKI), and enforcing finer image-text alignment via a semantic similarity matrix (SSM). The framework trains with a multimodal contrastive objective, incorporating continuous similarity signals and image-text matching losses, and is pre-trained on a large MIMIC-CXR-derived dataset with strong initialization from CLIP-era encoders. Across zero-shot, fine-tuned, unseen-type, segmentation, and retrieval tasks on RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, COVID, and MIMIC-CXR variants, MedFILIP achieves state-of-the-art or near-SOTA results with notable gains (e.g., up to 6.69% in zero-shot and improved Dice scores for segmentation). The work demonstrates that fine-grained, knowledge-augmented supervision can substantially improve medical VLP performance and generalization, offering practical benefits for diagnosing and interpreting chest radiographs in diverse clinical scenarios.

Abstract

Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69\%. The code is available in https://github.com/PerceptionComputingLab/MedFILIP.

MedFILIP: Medical Fine-grained Language-Image Pre-training

TL;DR

MedFILIP tackles the challenge of fine-grained medical image-language understanding by decoupling rich textual reports into precise disease triplets using a GPT-based information extractor (GPT-IE), linking categories to image-specific visual attributes through an image-specific knowledge injector (IKI), and enforcing finer image-text alignment via a semantic similarity matrix (SSM). The framework trains with a multimodal contrastive objective, incorporating continuous similarity signals and image-text matching losses, and is pre-trained on a large MIMIC-CXR-derived dataset with strong initialization from CLIP-era encoders. Across zero-shot, fine-tuned, unseen-type, segmentation, and retrieval tasks on RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, COVID, and MIMIC-CXR variants, MedFILIP achieves state-of-the-art or near-SOTA results with notable gains (e.g., up to 6.69% in zero-shot and improved Dice scores for segmentation). The work demonstrates that fine-grained, knowledge-augmented supervision can substantially improve medical VLP performance and generalization, offering practical benefits for diagnosing and interpreting chest radiographs in diverse clinical scenarios.

Abstract

Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69\%. The code is available in https://github.com/PerceptionComputingLab/MedFILIP.
Paper Structure (39 sections, 11 equations, 5 figures, 6 tables)

This paper contains 39 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Existing medical VLP methods are limited in adequately characterizing the relationships between images and diseases. Using reports as supervision risks (1a) confusing disease descriptions from a single report and (1b) separating same-disease samples from different cases into distinct classes, leading to inaccurate classification. Using disease entities as supervision (2a) disregards subclasses, inhibiting customized care fitting a patient's circumstances, and (2b) can not classify classes unseen in the training dataset, lacking generalization.
  • Figure 2: MedFILIP overcomes the limitations of previous methods by effectively modeling the complex relationship between images and reports. To address concurrent diseases and repeated occurrences, MedFILIP designs a GPT-IE to compress reports into fine-grained entities. To address overlooking subclasses, MedFILIP retains details like location and severity in the fine-grained entities. To improve the generalization of unseen categories, MedFILIP introduces an IKI to leverage image-specific knowledge from seen classes to guide inferences for unseen classes.
  • Figure 3: The proposed MedFILIP balances text complexity and information richness by distilling diagnostic reports into refined labeling formats and enhancing labels with medical image-specific knowledge. Specifically: 1) Diagnostic reports are processed by GPT-IE to obtain fine-grained entities. 2) IKI is used to augment disease entities with medical image-specific knowledge. 3) SSM is calculated based on the combination of fine-grained entities and explanations to establish clearer similarity-relationships between medical images and disease descriptions. 4) Multimodal contrastive learning is conducted under the supervision of SSM. Additionally, we randomly mask the texts to address the problem of partially missing labels.
  • Figure 4: Our method maps image and text features into an embedding space, where it strategically clusters similar features while distancing dissimilar ones, and it also finely groups closely related subclasses based on semantic similarity.
  • Figure 5: A comparison of the predicted cosine similarity scores between images and texts demonstrates that MedFILIP is better at classifying subclasses and associating images with fine-grained explanations. There are two parts of cosine similarity scores between the images and the corresponding texts, the left part calculates the similarities between the images and the texts of "template + category" as well as the texts of fine-grained categories, and the right part calculates the similarities between the images and the texts of "template + category" as well as the texts of fine-grained explanations of the image. In the bar chart of similarity scores, the bars correspond to the text segments highlighted in the same colors, the blue bars are cosine similarity between the images and the texts of "template + category", and the pink bars are cosine similarity between the images and the texts of fine-grained categories or explanations.