MedFILIP: Medical Fine-grained Language-Image Pre-training
Xinjie Liang, Xiangyu Li, Fanding Li, Jie Jiang, Qing Dong, Wei Wang, Kuanquan Wang, Suyu Dong, Gongning Luo, Shuo Li
TL;DR
MedFILIP tackles the challenge of fine-grained medical image-language understanding by decoupling rich textual reports into precise disease triplets using a GPT-based information extractor (GPT-IE), linking categories to image-specific visual attributes through an image-specific knowledge injector (IKI), and enforcing finer image-text alignment via a semantic similarity matrix (SSM). The framework trains with a multimodal contrastive objective, incorporating continuous similarity signals and image-text matching losses, and is pre-trained on a large MIMIC-CXR-derived dataset with strong initialization from CLIP-era encoders. Across zero-shot, fine-tuned, unseen-type, segmentation, and retrieval tasks on RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, COVID, and MIMIC-CXR variants, MedFILIP achieves state-of-the-art or near-SOTA results with notable gains (e.g., up to 6.69% in zero-shot and improved Dice scores for segmentation). The work demonstrates that fine-grained, knowledge-augmented supervision can substantially improve medical VLP performance and generalization, offering practical benefits for diagnosing and interpreting chest radiographs in diverse clinical scenarios.
Abstract
Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69\%. The code is available in https://github.com/PerceptionComputingLab/MedFILIP.
