Multi-scale Spatio-temporal Transformer-based Imbalanced Longitudinal Learning for Glaucoma Forecasting from Irregular Time Series Images

Xikai Yang; Jian Wu; Xi Wang; Yuchen Yuan; Ning Li Wang; Pheng-Ann Heng

Multi-scale Spatio-temporal Transformer-based Imbalanced Longitudinal Learning for Glaucoma Forecasting from Irregular Time Series Images

Xikai Yang, Jian Wu, Xi Wang, Yuchen Yuan, Ning Li Wang, Pheng-Ann Heng

TL;DR

This work tackles glaucoma forecasting from irregular longitudinal fundus images under severe class imbalance. It introduces MST-former, a multi-scale spatio-temporal transformer that uses space-time positional encoding, time-aware multi-head attention, and a scale-hierarchical encoder-decoder to jointly model spatial regions within images and disease progression over time. A temperature-controlled Balanced Softmax Cross-entropy loss mitigates heavy label imbalance, enabling end-to-end training. The method achieves state-of-the-art AUCs on SIGF (0.986) and strong generalization on ADNI MRI data, with ablations confirming the value of STP, TTA, and MS components. These results suggest MST-former offers a robust framework for longitudinal medical image forecasting with irregular sampling, with potential for multi-modal extensions and clinical impact.

Abstract

Glaucoma is one of the major eye diseases that leads to progressive optic nerve fiber damage and irreversible blindness, afflicting millions of individuals. Glaucoma forecast is a good solution to early screening and intervention of potential patients, which is helpful to prevent further deterioration of the disease. It leverages a series of historical fundus images of an eye and forecasts the likelihood of glaucoma occurrence in the future. However, the irregular sampling nature and the imbalanced class distribution are two challenges in the development of disease forecasting approaches. To this end, we introduce the Multi-scale Spatio-temporal Transformer Network (MST-former) based on the transformer architecture tailored for sequential image inputs, which can effectively learn representative semantic information from sequential images on both temporal and spatial dimensions. Specifically, we employ a multi-scale structure to extract features at various resolutions, which can largely exploit rich spatial information encoded in each image. Besides, we design a time distance matrix to scale time attention in a non-linear manner, which could effectively deal with the irregularly sampled data. Furthermore, we introduce a temperature-controlled Balanced Softmax Cross-entropy loss to address the class imbalance issue. Extensive experiments on the Sequential fundus Images for Glaucoma Forecast (SIGF) dataset demonstrate the superiority of the proposed MST-former method, achieving an AUC of 98.6% for glaucoma forecasting. Besides, our method shows excellent generalization capability on the Alzheimer's Disease Neuroimaging Initiative (ADNI) MRI dataset, with an accuracy of 90.3% for mild cognitive impairment and Alzheimer's disease prediction, outperforming the compared method by a large margin.

Multi-scale Spatio-temporal Transformer-based Imbalanced Longitudinal Learning for Glaucoma Forecasting from Irregular Time Series Images

TL;DR

Abstract

Paper Structure (26 sections, 6 equations, 7 figures, 7 tables)

This paper contains 26 sections, 6 equations, 7 figures, 7 tables.

Introduction
Related Works
Early Disease Forecast
Transformer with Spatial-temporal Representation
Methodology
Vanilla Transformer
MST-former Framework
Patch Embedding and Space-time Positional Encoding
Multi-head Spatial-Temporal Attention
Multi-scale Encoder-decoder Architecture
Balanced Softmax Cross-entropy Loss with Temperature Control
Experiments
Dataset and Evaluation Metrics
SIGF dataset
ADNI dataset
...and 11 more sections

Figures (7)

Figure 1: Examples of the sequential fundus images in the SIGF database. The upper panel shows a time-invariant sequence, showing that the patient's eye keeps the negative status across all time points. The lower panel is a time-variant sequence, showing that the patient's eye converts from normal to glaucoma at 2002/09/17.
Figure 2: Architecture of the conventional transformer.
Figure 3: Illustration of the proposed multi-scale spatio-temporal transformer network (MST-former), which includes $3$ scales. Within each scale, there are $N$ encoder and decoder blocks. The input of the encoder comprises the patch embedding and the space-time positional encoding information, while the input of the decoder contains the output embedding together with its positional embedding. The circle plus symbol represents the element-wise addition.
Figure 4: Spatial-temporal self-attention block in MST-former. Two-level self-attention steps (spatial self-attention and time-aware temporal self-attention) are included to calculate attention scores along the space and time dimensions.
Figure 5: Sample illustration of the Multi-scale structure of MST-former. Here, we present $3$ scales. During the process of scale transition, tokens that are topologically adjacent in a $2$ by $2$ format are merged together. Note that this multi-scale diagram depicts only one single image. For the input to be sequential medical images, it is necessary to perform the same operations in parallel for each individual image within the sequence.
...and 2 more figures

Multi-scale Spatio-temporal Transformer-based Imbalanced Longitudinal Learning for Glaucoma Forecasting from Irregular Time Series Images

TL;DR

Abstract

Multi-scale Spatio-temporal Transformer-based Imbalanced Longitudinal Learning for Glaucoma Forecasting from Irregular Time Series Images

Authors

TL;DR

Abstract

Table of Contents

Figures (7)