Enhancing Diagnostic Accuracy in Rare and Common Fundus Diseases with a Knowledge-Rich Vision-Language Model

Meng Wang; Tian Lin; Aidi Lin; Kai Yu; Yuanyuan Peng; Lianyu Wang; Cheng Chen; Ke Zou; Huiyu Liang; Man Chen; Xue Yao; Meiqin Zhang; Binwei Huang; Chaoxin Zheng; Peixin Zhang; Wei Chen; Yilong Luo; Yifan Chen; Honghe Xia; Tingkun Shi; Qi Zhang; Jinming Guo; Xiaolin Chen; Jingcheng Wang; Yih Chung Tham; Dianbo Liu; Wendy Wong; Sahil Thakur; Beau Fenner; Danqi Fang; Siying Liu; Qingyun Liu; Yuqiang Huang; Hongqiang Zeng; Yanda Meng; Yukun Zhou; Zehua Jiang; Minghui Qiu; Changqing Zhang; Xinjian Chen; Sophia Y. Wang; Cecilia S. Lee; Lucia Sobrin; Carol Y Cheung; Chi Pui Pang; Pearse A. Keane; Ching-Yu Cheng; Haoyu Chen; Huazhu Fu

Enhancing Diagnostic Accuracy in Rare and Common Fundus Diseases with a Knowledge-Rich Vision-Language Model

Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham, Dianbo Liu, Wendy Wong, Sahil Thakur, Beau Fenner, Danqi Fang, Siying Liu, Qingyun Liu, Yuqiang Huang, Hongqiang Zeng, Yanda Meng, Yukun Zhou, Zehua Jiang, Minghui Qiu, Changqing Zhang, Xinjian Chen, Sophia Y. Wang, Cecilia S. Lee, Lucia Sobrin, Carol Y Cheung, Chi Pui Pang, Pearse A. Keane, Ching-Yu Cheng, Haoyu Chen, Huazhu Fu

TL;DR

RetiZero presents a knowledge-rich vision-language framework for fundus disease analysis by unifying MAE-based self-supervision with CLIP-style image-text alignment and Dirichlet-based uncertainty calibration. Trained on 341,896 image-text pairs covering over 400 diseases, it achieves strong zero-shot recognition, image-to-image retrieval, and AI-assisted clinical diagnosis, outperforming prior ophthalmic LFMs and enhancing clinician accuracy across common and rare diseases. The approach demonstrates robust internal and cross-domain generalization, few-shot efficacy, and practical clinical utility, including improved diagnostic confidence among practitioners. Limitations include data imbalance across disease categories, motivating future work with synthetic data and targeted balancing to further boost performance in rare-pathology scenarios and real-world deployment.

Abstract

Previous foundation models for fundus images were pre-trained with limited disease categories and knowledge base. Here we introduce a knowledge-rich vision-language model (RetiZero) that leverages knowledge from more than 400 fundus diseases. For RetiZero's pretraining, we compiled 341,896 fundus images paired with texts, sourced from public datasets, ophthalmic literature, and online resources, encompassing a diverse range of diseases across multiple ethnicities and countries. RetiZero exhibits remarkable performance in several downstream tasks, including zero-shot disease recognition, image-to-image retrieval, AI-assisted clinical diagnosis,few-shot fine-tuning, and internal- and cross-domain disease identification. In zero-shot scenarios, RetiZero achieves Top-5 accuracies of 0.843 for 15 diseases and 0.756 for 52 diseases. For image retrieval, it achieves Top-5 scores of 0.950 and 0.886 for the same sets, respectively. AI-assisted clinical diagnosis results show that RetiZero's Top-3 zero-shot performance surpasses the average of 19 ophthalmologists from Singapore, China, and the United States. RetiZero substantially enhances clinicians' accuracy in diagnosing fundus diseases, in particularly rare ones. These findings underscore the value of integrating the RetiZero into clinical settings, where various fundus diseases are encountered.

Enhancing Diagnostic Accuracy in Rare and Common Fundus Diseases with a Knowledge-Rich Vision-Language Model

TL;DR

Abstract

Paper Structure (25 sections, 22 equations, 5 figures)

This paper contains 25 sections, 22 equations, 5 figures.

Introduction
Results
Zero-shot fundus disease recognition
Fundus disease identification by image-to-image retrieval
AI-assisted clinical diagnosis
Internal domain fundus disease identification
Few-shot fine-tuning
Cross-domain fundus disease identification
Discussion
Methods
Dataset
Data for pretraining:
Data for internal domain fundus disease identification:
Data for few-shot fine-tuning:
Data for cross-domain fundus disease identification:
...and 10 more sections

Figures (5)

Figure 1: Overview of the framework. a, Datasets for RetiZero pretraining: The RetiZero model was pre-trained using data from three primary sources: public datasets, ophthalmic literature, and online resources. We assembled a team of 12 ophthalmologists for manual data collection and cleaning. This involved downloading images and corresponding labels from public datasets, extracting images and corresponding disease-related keywords from ophthalmic literature, and downloading retinal diseases-relevant image-text pairs from online resources. b, RetiZero, which combines the strengths of self-supervised learning based on the MAE architecture and contrastive learning from the CLIP architecture. Moreover, we introduce an uncertainty vision-language feature calibration method into the contrastive vision-language pretraining framework, to further calibrate visual-language features in the high-dimensional embedding space. c, Task I: Zero-shot fundus disease recognition. d, Task II: Fundus disease identification by image-to-image retrieval. e, Task III: AI-assisted clinical diagnosis. f, Task IV: Internal domain retinal disease identification. "Internal domain" means that we fine-tuned and tested the model using the data with similar feature distribution. g, Task V: Few-shot fine-tuning. We evaluate RetiZero's performance in identifying fundus diseases with very limited training data. h, Task VI: Cross-domain fundus disease identification. "Cross-domain" means that we fine-tuned and tested the model using the data with different feature distributions.
Figure 2: Overall Top-1, Top-3, and Top-5 scores for zero-shot based fundus disease recognition and Fundus disease identification by image-to-image retrieval. a, The zero-shot performance on EYE-15 dataset, which contains 30,089 fundus images including 14 common fundus diseases and a normal condition. b, The zero-shot performance on the EYE-52 dataset, which contains 7,007 fundus images including 51 categories of fundus diseases and a normal condition. c, Zero-shot fundus diseases identification samples. d, Image-to-image retrieval performance on EYE-15 dataset. e, Image-to-image retrieval performance on the EYE-52 dataset. f, Image-to-image retrieval samples. All P values were calculated with the two-sided t-test and listed in the figure.
Figure 3: AI-assisted clinical diagnosis results. a, Online fundus image reading system without RetiZero assistance. b, Online fundus image reading system with RetiZero assistance, c, Ophthalmologist diagnostic results, Top-1, Top-3, and Top-5 performance for zero-shot and image-to-image retrieval. d, Details for clinical evaluation.
Figure 4: The receiver operating characteristic (ROC). a, ROC curves for internal domain retinal disease identification. b, ROC curves for few-shot learning. The P values were calculated with the two-sided t-test.
Figure 5: Cross-domain performance (AUC) of different foundation models for fundus disease screening. column a, Internal evaluation: Different foundation models were adapted to each dataset by fine-tuning and internally evaluated on hold-out testing data. Columns b-c, performance on external validation sets: The three foundation models were tested on the other two external validation datasets. The disease categories and dataset strategy information are listed in Supplementary Tables 10-12. The error bars show 95% CI and the bar centre represents the mean value of the AUC. P value was calculated with the two-sided t-test and listed in the figure.

Enhancing Diagnostic Accuracy in Rare and Common Fundus Diseases with a Knowledge-Rich Vision-Language Model

TL;DR

Abstract

Enhancing Diagnostic Accuracy in Rare and Common Fundus Diseases with a Knowledge-Rich Vision-Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (5)