Table of Contents
Fetching ...

A Foundation Language-Image Model of the Retina (FLAIR): Encoding Expert Knowledge in Text Supervision

Julio Silva-Rodríguez, Hadi Chakor, Riadh Kobbi, Jose Dolz, Ismail Ben Ayed

TL;DR

FLAIR addresses the domain gap in medical vision-language understanding by training a universal retina-focused model that encodes expert-domain knowledge as text prompts. It learns a joint embedding space for images and text using a contrastive objective on 38 public retinal datasets (288,307 images, 101 categories), augmented with EK prompts to capture fine-grained features and hierarchies. In zero-shot and few-shot settings, FLAIR with EK prompts outperforms generalist models and task-specific baselines, and, with lightweight adapters, approaches or exceeds dataset-specific fine-tuning in many scenarios. The results demonstrate the potential of incorporating domain knowledge into vision-language pre-training to achieve robust generalization across domain shifts and unseen diseases, with practical implications for scalable retinal disease screening and transfer to related ophthalmic tasks.

Abstract

Foundation vision-language models are currently transforming computer vision, and are on the rise in medical imaging fueled by their very promising generalization capabilities. However, the initial attempts to transfer this new paradigm to medical imaging have shown less impressive performances than those observed in other domains, due to the significant domain shift and the complex, expert domain knowledge inherent to medical-imaging tasks. Motivated by the need for domain-expert foundation models, we present FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding. To this end, we compiled 38 open-access, mostly categorical fundus imaging datasets from various sources, with up to 101 different target conditions and 288,307 images. We integrate the expert's domain knowledge in the form of descriptive textual prompts, during both pre-training and zero-shot inference, enhancing the less-informative categorical supervision of the data. Such a textual expert's knowledge, which we compiled from the relevant clinical literature and community standards, describes the fine-grained features of the pathologies as well as the hierarchies and dependencies between them. We report comprehensive evaluations, which illustrate the benefit of integrating expert knowledge and the strong generalization capabilities of FLAIR under difficult scenarios with domain shifts or unseen categories. When adapted with a lightweight linear probe, FLAIR outperforms fully-trained, dataset-focused models, more so in the few-shot regimes. Interestingly, FLAIR outperforms by a wide margin larger-scale generalist image-language models and retina domain-specific self-supervised networks, which emphasizes the potential of embedding experts' domain knowledge and the limitations of generalist models in medical imaging.

A Foundation Language-Image Model of the Retina (FLAIR): Encoding Expert Knowledge in Text Supervision

TL;DR

FLAIR addresses the domain gap in medical vision-language understanding by training a universal retina-focused model that encodes expert-domain knowledge as text prompts. It learns a joint embedding space for images and text using a contrastive objective on 38 public retinal datasets (288,307 images, 101 categories), augmented with EK prompts to capture fine-grained features and hierarchies. In zero-shot and few-shot settings, FLAIR with EK prompts outperforms generalist models and task-specific baselines, and, with lightweight adapters, approaches or exceeds dataset-specific fine-tuning in many scenarios. The results demonstrate the potential of incorporating domain knowledge into vision-language pre-training to achieve robust generalization across domain shifts and unseen diseases, with practical implications for scalable retinal disease screening and transfer to related ophthalmic tasks.

Abstract

Foundation vision-language models are currently transforming computer vision, and are on the rise in medical imaging fueled by their very promising generalization capabilities. However, the initial attempts to transfer this new paradigm to medical imaging have shown less impressive performances than those observed in other domains, due to the significant domain shift and the complex, expert domain knowledge inherent to medical-imaging tasks. Motivated by the need for domain-expert foundation models, we present FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding. To this end, we compiled 38 open-access, mostly categorical fundus imaging datasets from various sources, with up to 101 different target conditions and 288,307 images. We integrate the expert's domain knowledge in the form of descriptive textual prompts, during both pre-training and zero-shot inference, enhancing the less-informative categorical supervision of the data. Such a textual expert's knowledge, which we compiled from the relevant clinical literature and community standards, describes the fine-grained features of the pathologies as well as the hierarchies and dependencies between them. We report comprehensive evaluations, which illustrate the benefit of integrating expert knowledge and the strong generalization capabilities of FLAIR under difficult scenarios with domain shifts or unseen categories. When adapted with a lightweight linear probe, FLAIR outperforms fully-trained, dataset-focused models, more so in the few-shot regimes. Interestingly, FLAIR outperforms by a wide margin larger-scale generalist image-language models and retina domain-specific self-supervised networks, which emphasizes the potential of embedding experts' domain knowledge and the limitations of generalist models in medical imaging.
Paper Structure (46 sections, 5 equations, 14 figures, 9 tables)

This paper contains 46 sections, 5 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: CLIP limitations on medical domains. The figure depicts the cosine similarities of the text embeddings for common retinal diseases and lesions observed on fundus images. While CLIP mostly focuses on general medical relations (e.g., “diabetic”, or “neovascularization”- “venous”), the proposed domain-specific model (i.e., FLAIR) is able to capture the hierarchical dependencies between concepts (e.g., the fundus images of “mildDR” contain “only a few microaneurysms”, and “neovascularization” is the differential sign for “prolDR” diagnosis).
  • Figure 2: Expert knowledge descriptors. The analysis of fundus images by ophthalmologists is driven by hierarchical features. According to the American Academy of Ophthalmology Wilkinson2003, mildDR is characterized by “only few microaneurysms present”, modDR includes “retinal haemorrhages in few quadrants”, “many haemorrhages” or “cotton wool spots”, and sevDR and prolDR are distinguished by “venous beading”/“intraretinal microvascular abnormalities” and “neovascularization”, respectively. DME is also usually featured by “hard exudates involving the center of the macula”. Furthermore, according to sevHR, hypertensive retinopathy is generally described as “flame-shaped hemorrhages in the superficial layers of the retina and cotton-wool patches”. Going deeper into the hierarchies between concepts, exudates are “small white or yellowish deposits”, and microaneurysms are “small red dots”.
  • Figure 3: Framework overview. We have developed a knowledge-based universal model of the retina from an assembly of 38 public datasets, which contains 288,307 color fundus images and 101 different categories (see top-left). The foundation model consists of vision and language encoders, which are trained in a contrastive fashion on paired images and textual descriptors. To mitigate the scarcity of text-based supervision in publicly available retinal fundus imaging datasets, we propose to augment the categorical image labels by using well-established domain knowledge (see top-right). The ensuing pre-training model enables to prediction of new categories in a zero-shot fashion, using well-designed descriptors based on domain knowledge and local features of the novel diseases; see bottom-left. In addition, the model could adapt to downstream tasks and domains by tuning a lightweight Adapter on top of the image and vision encoders, by using only a few labeled samples (the support set); see bottom-right.
  • Figure 4: Transferability. Results of transferring the feature representations of the pre-trained models to downstream domains and tasks in the low-data (left column) and large-data (right column) regimes. The results were obtained by adjusting a linear-probe classifier. The metric presented is the average accuracy, averaged across 5 cross-validation folds. ZS: zero-shot (i.e., prompt-based).
  • Figure 5: Vision-language few-shot Adapters. The results of different Adapters in the few-shot setting. The metric presented is the average accuracy, averaged across 5 cross-validation folds. ZS: zero-shot (i.e., prompt-based classification with domain-knowledge prompts).
  • ...and 9 more figures