Table of Contents
Fetching ...

L-MAE: Longitudinal masked auto-encoder with time and severity-aware encoding for diabetic retinopathy progression prediction

Rachid Zeghlache, Pierre-Henri Conze, Mostafa El Habib Daho, Yihao Li, Alireza Rezaei, Hugo Le Boité, Ramin Tadayoni, Pascal Massin, Béatrice Cochener, Ikram Brahim, Gwenolé Quellec, Mathieu Lamard

TL;DR

This work tackles predicting diabetic retinopathy (DR) progression over multi-year horizons by leveraging a longitudinal masked auto-encoder (L-MAE) built on Vision Transformer principles. It introduces time-aware positional encoding and progression-aware masking to fuse irregular temporal information and disease dynamics into self-supervised pretraining, followed by fine-tuning for a 3-year severity prediction task on the OPHDIAT dataset. Empirical results show that incorporating temporal information and clinically informed masking substantially boosts predictive performance, particularly for severe progression, compared with standard MAE and longitudinal baselines. The approach offers a practical pathway to learn strong, transferable representations from longitudinal retinal images, with potential to inform personalized screening intervals and extend to other retinal diseases and modalities. $L$-value notation and temporal embeddings are explicitly leveraged to capture disease evolution, enabling robust, time-aware risk stratification in ophthalmology.$

Abstract

Pre-training strategies based on self-supervised learning (SSL) have proven to be effective pretext tasks for many downstream tasks in computer vision. Due to the significant disparity between medical and natural images, the application of typical SSL is not straightforward in medical imaging. Additionally, those pretext tasks often lack context, which is critical for computer-aided clinical decision support. In this paper, we developed a longitudinal masked auto-encoder (MAE) based on the well-known Transformer-based MAE. In particular, we explored the importance of time-aware position embedding as well as disease progression-aware masking. Taking into account the time between examinations instead of just scheduling them offers the benefit of capturing temporal changes and trends. The masking strategy, for its part, evolves during follow-up to better capture pathological changes, ensuring a more accurate assessment of disease progression. Using OPHDIAT, a large follow-up screening dataset targeting diabetic retinopathy (DR), we evaluated the pre-trained weights on a longitudinal task, which is to predict the severity label of the next visit within 3 years based on the past time series examinations. Our results demonstrated the relevancy of both time-aware position embedding and masking strategies based on disease progression knowledge. Compared to popular baseline models and standard longitudinal Transformers, these simple yet effective extensions significantly enhance the predictive ability of deep classification models.

L-MAE: Longitudinal masked auto-encoder with time and severity-aware encoding for diabetic retinopathy progression prediction

TL;DR

This work tackles predicting diabetic retinopathy (DR) progression over multi-year horizons by leveraging a longitudinal masked auto-encoder (L-MAE) built on Vision Transformer principles. It introduces time-aware positional encoding and progression-aware masking to fuse irregular temporal information and disease dynamics into self-supervised pretraining, followed by fine-tuning for a 3-year severity prediction task on the OPHDIAT dataset. Empirical results show that incorporating temporal information and clinically informed masking substantially boosts predictive performance, particularly for severe progression, compared with standard MAE and longitudinal baselines. The approach offers a practical pathway to learn strong, transferable representations from longitudinal retinal images, with potential to inform personalized screening intervals and extend to other retinal diseases and modalities. -value notation and temporal embeddings are explicitly leveraged to capture disease evolution, enabling robust, time-aware risk stratification in ophthalmology.$

Abstract

Pre-training strategies based on self-supervised learning (SSL) have proven to be effective pretext tasks for many downstream tasks in computer vision. Due to the significant disparity between medical and natural images, the application of typical SSL is not straightforward in medical imaging. Additionally, those pretext tasks often lack context, which is critical for computer-aided clinical decision support. In this paper, we developed a longitudinal masked auto-encoder (MAE) based on the well-known Transformer-based MAE. In particular, we explored the importance of time-aware position embedding as well as disease progression-aware masking. Taking into account the time between examinations instead of just scheduling them offers the benefit of capturing temporal changes and trends. The masking strategy, for its part, evolves during follow-up to better capture pathological changes, ensuring a more accurate assessment of disease progression. Using OPHDIAT, a large follow-up screening dataset targeting diabetic retinopathy (DR), we evaluated the pre-trained weights on a longitudinal task, which is to predict the severity label of the next visit within 3 years based on the past time series examinations. Our results demonstrated the relevancy of both time-aware position embedding and masking strategies based on disease progression knowledge. Compared to popular baseline models and standard longitudinal Transformers, these simple yet effective extensions significantly enhance the predictive ability of deep classification models.
Paper Structure (19 sections, 7 equations, 5 figures, 5 tables)

This paper contains 19 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Diabetic retinopathy progression stages. The number and type of anomalies increase with severity. As severity increases, we observe more anomalies in peripheral zones.
  • Figure 2: An example of position encoding in the time dimension is provided. The conventional approach to position encoding only encodes the order of positions, disregarding the time intervals between elements. Our innovative masking strategies, which are sensitive to disease progression dynamics, incorporate time-dependent information to allow the model to capture the dynamics of disease progression.
  • Figure 3: Illustration of our proposed progression-aware masking strategies.
  • Figure 4: Illustration of the ViVit. In the representation, we include our time-aware position encoding instead of a regular position encoding along the temporal dimension.
  • Figure 5: Illustration of our longitudinal masked auto-encoder. Our proposed longitudinal masked autoencoder differs from the classic Video-MAE from two perspectives. The first one is the fact that, in the embedding layers, we add our proposed time-aware position embedding to the classical position embeddings used in the transformer. The second difference lies in the masking strategies.