A multi-modal vision-language model for generalizable annotation-free pathology localization

Hao Yang; Hong-Yu Zhou; Jiarun Liu; Weijian Huang; Cheng Li; Zhihuan Li; Yuanxu Gao; Qiegen Liu; Yong Liang; Qi Yang; Song Wu; Tao Tan; Hairong Zheng; Kang Zhang; Shanshan Wang

A multi-modal vision-language model for generalizable annotation-free pathology localization

Hao Yang, Hong-Yu Zhou, Jiarun Liu, Weijian Huang, Cheng Li, Zhihuan Li, Yuanxu Gao, Qiegen Liu, Yong Liang, Qi Yang, Song Wu, Tao Tan, Hairong Zheng, Kang Zhang, Shanshan Wang

TL;DR

It is shown that AFLoc exhibits robust generalization capabilities, even surpassing human benchmarks in localizing five different types of pathological images, which highlights the potential of AFLoc in reducing annotation requirements and its applicability in complex clinical environments.

Abstract

Existing deep learning models for defining pathology from clinical imaging data rely on expert annotations and lack generalization capabilities in open clinical environments. Here, we present a generalizable vision-language model for Annotation-Free pathology Localization (AFLoc). The core strength of AFLoc is extensive multi-level semantic structure-based contrastive learning, which comprehensively aligns multi-granularity medical concepts with abundant image features to adapt to the diverse expressions of pathologies without the reliance on expert image annotations. We conduct primary experiments on a dataset of 220K pairs of image-report chest X-ray images and perform validation across eight external datasets encompassing 34 types of chest pathologies. The results demonstrate that AFLoc outperforms state-of-the-art methods in both annotation-free localization and classification tasks. Additionally, we assess the generalizability of AFLoc on other modalities, including histopathology and retinal fundus images. We show that AFLoc exhibits robust generalization capabilities, even surpassing human benchmarks in localizing five different types of pathological images. These results highlight the potential of AFLoc in reducing annotation requirements and its applicability in complex clinical environments.

A multi-modal vision-language model for generalizable annotation-free pathology localization

TL;DR

Abstract

Paper Structure (18 sections, 7 equations, 9 figures, 12 tables)

This paper contains 18 sections, 7 equations, 9 figures, 12 tables.

Figures (9)

Figure 1: Overview of the AFLoc’s annotation-free pipeline for pathology localization. a. Annotation-free vision-language pre-training: AFLoc leverages contrastive-based vision-language pre-training with existing images and text reports to eliminate the need for additional annotation efforts. A multi-level semantic alignment scheme is proposed to facilitate the comprehensive alignment of medical concepts across text reports with image features. b. Inference: AFLoc can classify and localize all potential pathologies within the lesion list. The model encodes the input image and the automatically generated text prompts into feature embeddings. Then the local and global level feature similarity are computed for localization and classification respectively. The model only output localization results with positive classification predictions.
Figure 2: Comparisons of AFLoc with state-of-the-art methods across five evaluation datasets for pathology localization in chest X-ray, retinal fundus, and histopathology images. For each method–dataset pair, IoU, Dice similarity coefficient, and CNR are reported. The central dots represent the mean, and the vertical error bars indicate the 95% CI. The variable n denotes the number of evaluation images in each dataset. Detailed results are provided in Supplementary Table\ref{['extab:loc_cxr3']}-\ref{['extab:loc_path']}.
Figure 3: Comparisons of AFLoc in the task of pathology localization across different chest pathologies. a. Results on the MS-CXR dataset, compared with existing state-of-the-art unsupervised anomaly detection models and vision-language models. The variable n denotes the number of evaluation images in each dataset. In each boxplot, the solid center line represents the median, the dashed line represents the mean, the box boundaries correspond to the first and third quartiles, the whiskers extend to the most extreme data points that are not considered outliers, and the outliers are represented by dots. b. Results on the CheXlocalize dataset, compared with various saliency methods and the human benchmark.
Figure 4: Visualization of pathology localization results across different imaging modalities and diseases. Black dashed boxes indicate the pathology areas marked by radiologists, and the intensity of red color in the heatmaps signifies the focus level of the model's prediction, with deeper red indicating higher focus.
Figure 5: Visualization of pathology localization by different models on MS-CXR. Results with simple and precise descriptions are shown to demonstrate the models' performance under different description granularities. Black dashed boxes indicate the pathology areas marked by radiologists, while deeper red indicates a higher focus level in the models' predictions.
...and 4 more figures