L-MAE: Longitudinal masked auto-encoder with time and severity-aware encoding for diabetic retinopathy progression prediction
Rachid Zeghlache, Pierre-Henri Conze, Mostafa El Habib Daho, Yihao Li, Alireza Rezaei, Hugo Le Boité, Ramin Tadayoni, Pascal Massin, Béatrice Cochener, Ikram Brahim, Gwenolé Quellec, Mathieu Lamard
TL;DR
This work tackles predicting diabetic retinopathy (DR) progression over multi-year horizons by leveraging a longitudinal masked auto-encoder (L-MAE) built on Vision Transformer principles. It introduces time-aware positional encoding and progression-aware masking to fuse irregular temporal information and disease dynamics into self-supervised pretraining, followed by fine-tuning for a 3-year severity prediction task on the OPHDIAT dataset. Empirical results show that incorporating temporal information and clinically informed masking substantially boosts predictive performance, particularly for severe progression, compared with standard MAE and longitudinal baselines. The approach offers a practical pathway to learn strong, transferable representations from longitudinal retinal images, with potential to inform personalized screening intervals and extend to other retinal diseases and modalities. $L$-value notation and temporal embeddings are explicitly leveraged to capture disease evolution, enabling robust, time-aware risk stratification in ophthalmology.$
Abstract
Pre-training strategies based on self-supervised learning (SSL) have proven to be effective pretext tasks for many downstream tasks in computer vision. Due to the significant disparity between medical and natural images, the application of typical SSL is not straightforward in medical imaging. Additionally, those pretext tasks often lack context, which is critical for computer-aided clinical decision support. In this paper, we developed a longitudinal masked auto-encoder (MAE) based on the well-known Transformer-based MAE. In particular, we explored the importance of time-aware position embedding as well as disease progression-aware masking. Taking into account the time between examinations instead of just scheduling them offers the benefit of capturing temporal changes and trends. The masking strategy, for its part, evolves during follow-up to better capture pathological changes, ensuring a more accurate assessment of disease progression. Using OPHDIAT, a large follow-up screening dataset targeting diabetic retinopathy (DR), we evaluated the pre-trained weights on a longitudinal task, which is to predict the severity label of the next visit within 3 years based on the past time series examinations. Our results demonstrated the relevancy of both time-aware position embedding and masking strategies based on disease progression knowledge. Compared to popular baseline models and standard longitudinal Transformers, these simple yet effective extensions significantly enhance the predictive ability of deep classification models.
