Table of Contents
Fetching ...

Metadata-enhanced contrastive learning from retinal optical coherence tomography images

Robbie Holland, Oliver Leingang, Hrvoje Bogunović, Sophie Riedl, Lars Fritsche, Toby Prevost, Hendrik P. N. Scholl, Ursula Schmidt-Erfurth, Sobha Sivaprasad, Andrew J. Lotery, Daniel Rueckert, Martin J. Menten

TL;DR

This work tackles the challenge of applying contrastive self-supervised learning to medical retinal OCT data by introducing a metadata-enhanced framework that leverages longitudinal patient information (identity, eye laterality, and scan timing) to define informative positive and negative pairs. By reconstituting inter-image relationships with a temporal window $\delta_T$ and excluding ambiguous cross-patient negatives, the authors adapt SimCLR and BYOL to OCT data and demonstrate substantial improvements over standard pretraining and a retinal foundation model across seven AMD-related downstream tasks in two large cohorts. The approach yields strong data-efficiency, with 20x–100x fewer labeled samples sometimes sufficient to match or exceed baseline performance, highlighting the practical potential for label-efficient retinal disease screening and monitoring. The results suggest a generalizable strategy for integrating readily available metadata into self-supervised learning in medical imaging and motivate extensions to other modalities and diseases.

Abstract

Deep learning has potential to automate screening, monitoring and grading of disease in medical images. Pretraining with contrastive learning enables models to extract robust and generalisable features from natural image datasets, facilitating label-efficient downstream image analysis. However, the direct application of conventional contrastive methods to medical datasets introduces two domain-specific issues. Firstly, several image transformations which have been shown to be crucial for effective contrastive learning do not translate from the natural image to the medical image domain. Secondly, the assumption made by conventional methods, that any two images are dissimilar, is systematically misleading in medical datasets depicting the same anatomy and disease. This is exacerbated in longitudinal image datasets that repeatedly image the same patient cohort to monitor their disease progression over time. In this paper we tackle these issues by extending conventional contrastive frameworks with a novel metadata-enhanced strategy. Our approach employs widely available patient metadata to approximate the true set of inter-image contrastive relationships. To this end we employ records for patient identity, eye position (i.e. left or right) and time series information. In experiments using two large longitudinal datasets containing 170,427 retinal OCT images of 7,912 patients with age-related macular degeneration (AMD), we evaluate the utility of using metadata to incorporate the temporal dynamics of disease progression into pretraining. Our metadata-enhanced approach outperforms both standard contrastive methods and a retinal image foundation model in five out of six image-level downstream tasks related to AMD. Due to its modularity, our method can be quickly and cost-effectively tested to establish the potential benefits of including available metadata in contrastive pretraining.

Metadata-enhanced contrastive learning from retinal optical coherence tomography images

TL;DR

This work tackles the challenge of applying contrastive self-supervised learning to medical retinal OCT data by introducing a metadata-enhanced framework that leverages longitudinal patient information (identity, eye laterality, and scan timing) to define informative positive and negative pairs. By reconstituting inter-image relationships with a temporal window and excluding ambiguous cross-patient negatives, the authors adapt SimCLR and BYOL to OCT data and demonstrate substantial improvements over standard pretraining and a retinal foundation model across seven AMD-related downstream tasks in two large cohorts. The approach yields strong data-efficiency, with 20x–100x fewer labeled samples sometimes sufficient to match or exceed baseline performance, highlighting the practical potential for label-efficient retinal disease screening and monitoring. The results suggest a generalizable strategy for integrating readily available metadata into self-supervised learning in medical imaging and motivate extensions to other modalities and diseases.

Abstract

Deep learning has potential to automate screening, monitoring and grading of disease in medical images. Pretraining with contrastive learning enables models to extract robust and generalisable features from natural image datasets, facilitating label-efficient downstream image analysis. However, the direct application of conventional contrastive methods to medical datasets introduces two domain-specific issues. Firstly, several image transformations which have been shown to be crucial for effective contrastive learning do not translate from the natural image to the medical image domain. Secondly, the assumption made by conventional methods, that any two images are dissimilar, is systematically misleading in medical datasets depicting the same anatomy and disease. This is exacerbated in longitudinal image datasets that repeatedly image the same patient cohort to monitor their disease progression over time. In this paper we tackle these issues by extending conventional contrastive frameworks with a novel metadata-enhanced strategy. Our approach employs widely available patient metadata to approximate the true set of inter-image contrastive relationships. To this end we employ records for patient identity, eye position (i.e. left or right) and time series information. In experiments using two large longitudinal datasets containing 170,427 retinal OCT images of 7,912 patients with age-related macular degeneration (AMD), we evaluate the utility of using metadata to incorporate the temporal dynamics of disease progression into pretraining. Our metadata-enhanced approach outperforms both standard contrastive methods and a retinal image foundation model in five out of six image-level downstream tasks related to AMD. Due to its modularity, our method can be quickly and cost-effectively tested to establish the potential benefits of including available metadata in contrastive pretraining.
Paper Structure (25 sections, 1 equation, 9 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 1 equation, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: In this paper we pretrain models with existing contrastive frameworks (BYOL and SimCLR) and our metadata-enhanced contrastive versions on large datasets of unlabelled OCT images. Our method addresses existing weaknesses of standard frameworks by leveraging metadata widely available in the clinical workflow. To this end we employ information of patient identity, eye position (i.e. left or right) and time series information to indicate the set of true inter-image contrastive relationships. We benchmark pretraining strategies by quantifying improvements on seven downstream tasks related to the clinical assessment of AMD.
  • Figure 2: Our method enhances standard contrastive pretraining with widely available medical metadata to correct many of the misleading negative pairs that arise systematically in standard frameworks. Moreover, by introducing inter-image positive pairs we combine artificial image transformations with natural ones that already exist between images acquired closely in time (controlled by a $\delta_T$ parameter). Our method also removes contrastive pairs with unknown relationships and retains negative pairs containing images originating from different patients.
  • Figure 3: The distribution of longitude lengths (left) in the Southampton and Moorfields datasets and the frequency distribution of time intervals $\delta_T$ in years between all pairs of longitudinal scans from the same eye (right), with the width of each bin covering a duration of one month.
  • Figure 4: Results of linear evaluation on downstream tasks using seven logarithmically spaced amounts of labelled finetuning samples (with 95% confidence intervals). For clarity we omit the ImageNet and RadImageNet models, which were consistently outperformed by the RETFound baseline (model colour key in bottom right). In all tasks except segmentation, and visual acuity on Moorfields data, metadata-enhanced contrastive pretraining extending BYOL and using $\delta_T\leq1.0$ outperforms standard contrastive learning and RETFound, especially in scenarios with fewer labelled data.
  • Figure A.5: Performance of finetuning models with fully unfrozen weights on downstream tasks on both datasets. Depicted is the performance (with 95% CIs) against varying sizes of the labelled subsets used for finetuning.
  • ...and 4 more figures