Masked Image Modelling for retinal OCT understanding

Theodoros Pissas; Pablo Márquez-Neila; Sebastian Wolf; Martin Zinkernagel; Raphael Sznitman

Masked Image Modelling for retinal OCT understanding

Theodoros Pissas, Pablo Márquez-Neila, Sebastian Wolf, Martin Zinkernagel, Raphael Sznitman

TL;DR

Masked image modelling with MAE is applied to retinal OCT to learn generalizable representations from a large, diverse real-world dataset. The authors extend MAE to a multimodal setting by pairing OCT with infrared fundus images, enabling joint representation learning and improved multimodal task performance. Through extensive evaluation on six OCT tasks and a multimodal diagnosis task, the approach outperforms baselines including ImageNet-pretrained, RETFound, and DINOv2 in both finetuning and linear probing, and demonstrates robustness across scanner types. The work provides public data splits, code, and model weights, highlighting the value of large-scale self-supervised pretraining and multimodal fusion for practical OCT understanding.

Abstract

This work explores the effectiveness of masked image modelling for learning representations of retinal OCT images. To this end, we leverage Masked Autoencoders (MAE), a simple and scalable method for self-supervised learning, to obtain a powerful and general representation for OCT images by training on 700K OCT images from 41K patients collected under real world clinical settings. We also provide the first extensive evaluation for a model of OCT on a challenging battery of 6 downstream tasks. Our model achieves strong performance when fully finetuned but can also serve as a versatile frozen feature extractor for many tasks using lightweight adapters. Furthermore, we propose an extension of the MAE pretraining to fuse OCT with an auxiliary modality, namely, IR fundus images and learn a joint model for both. We demonstrate our approach improves performance on a multimodal downstream application. Our experiments utilize most publicly available OCT datasets, thus enabling future comparisons. Our code and model weights are publicly available https://github.com/TheoPis/MIM_OCT.

Masked Image Modelling for retinal OCT understanding

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 3 figures, 9 tables)

This paper contains 16 sections, 2 equations, 3 figures, 9 tables.

Introduction
Method
Preliminaries
Unimodal masked image modelling
Multimodal masked image modelling
Experimental setup
Datasets and downstream tasks:
Implementation details:
Downstream task adaptation:
Baselines:
Experiments:
Results
Unimodal OCT Model
Multimodal model
Conclusion
...and 1 more sections

Figures (3)

Figure 1: Unimodal mae and our proposed multimodal masked image modelling.
Figure 2: Examples of reconstructions by our model on unseen images from RETOUCH. The model's encoder, without any finetuning, combined with a lightweight feature pyramid network, produces fluid segmentations of higher quality than DINOv2 with the same approach.
Figure 3: Multimodal reconstruction examples. OCT and IR images are encoded by our joint multimodal encoder and decoded by modality specific decoders.

Masked Image Modelling for retinal OCT understanding

TL;DR

Abstract

Masked Image Modelling for retinal OCT understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (3)